Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retype struct as primitive types in import. #33225

Merged
merged 5 commits into from
May 5, 2020

Conversation

sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Mar 5, 2020

This change adds COMPlus_JitNoStructRetyping that prevents struct retyping for call struct and return struct cases. Currently, this retyping is happening in importer, we want to move it to lower.
The current retyping forbids later phases to do optimization with these values, for example, it affects code for inlined methods

struct nativeSizeStruct
{
  int a;
  int b;
}
nativeSizeStruct foo();

if we inline foo() we won't be able to promote or enregister fields of nativeSizeStruct because importer would access it as LCL_FIELD long and set doNotEnreg in impFixupCallStructReturn.

Why can't we do that right after inlining? Because other optimizations also need to know real types, for example, CSE and VN can't propagate a sequence like that:

ASG(nativeSizeStruct A, nativeSizeStruct B);
return LCL_FIELD long A;

but can do it if we don't do retyping.

Notes: Note1: right after inlining is the right time to do retyping for methods that use return buffer, in that PR this is left in importer.

Note2: that change doesn't fix retyping in impNormStructVal, for cases like
methodWithStructArgs(foo(), foo()) we currently always create a local var for each struct local argument and retype it as a primitive type. That would be fixed in a separate PR.

Note3: that change doesn't fix retyping for limited struct promotion, when for a struct like:
struct PromotedStruct
{
StructWithOneField a; <- we will change the type of this field from struct to int, because we don't have recursive struct promotion.
int b;
}

struct StructWithOneField
{
int a;
}
but that change will get us closer to it.

So the phase in which we should retype struct types into ABI specific types is lowering. See details in https://github.com/dotnet/runtime/blob/master/docs/design/features/first-class-structs.md and #1231.

The PR is done to have no diffs if COMPlus_JitNoStructRetyping=0, it helps to catch unwanted side-effects and makes merging safer. I have checked that there are no diffs on all windows platforms using altjit and framework assemblies. The change with the flag enabled was tested on SPMI collections, crossgen of framework libraries, and in a pri1 test run.

The overall design is simple: keep structs as struct until lowering, then do retyping for returns and calls, but insert BITCAST back to struct types to keep IR correct. Then teach the next phases (lsra, codegen) to work with the new struct nodes.

New cases of struct nodes force us to have struct handle on all trees (except the right side of ASG), so gtGetStructHandleIfPresent becomes more important.

Some initial diffs:
for the benchmark that motivated this change:

| ScalarFloatSinglethreadADT | \Core_Root_base\CoreRun.exe | 4.657 s | 0.0317 s | 0.0296 s | 4.650 s | 4.618 s | 4.710 s |  1.00 |             Base |     - |     - |     - |     272 B |
| ScalarFloatSinglethreadADT | \Core_Root_diff\CoreRun.exe | 1.157 s | 0.0083 s | 0.0078 s | 1.157 s | 1.145 s | 1.173 s |  0.25 |           Faster |     - |     - |     - |     272 B |

some other benchmarks are also winning, but not that significant:

| RegexRedux_1 | \Core_Root_base\CoreRun.exe | 603.4 ms | 96.37 ms | 110.97 ms | 612.5 ms | 424.6 ms | 762.8 ms |  1.00 |             Base |    0.00 |     - |     - |     - |   2.83 MB |
| RegexRedux_1 | \Core_Root_diff\CoreRun.exe | 431.6 ms |  7.89 ms |   6.99 ms | 432.3 ms | 418.2 ms | 443.7 ms |  0.76 |           Faster |    0.17 |     - |     - |     - |   2.83 MB |

code size changes for my small StructABI\structreturn.cs test, improvements happen when we inline the constructor method:

Overall, right now, it is a regression, I will start fixing them in the next change. Maybe I will push them to this PR or merge this PR with the flag disabled and fix the regressions in the next.

@sandreenko sandreenko added os-windows arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 5, 2020
@sandreenko
Copy link
Contributor Author

PTAL @CarolEidt @dotnet/jit-contrib, I think that is ready for the first round of review.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.
It's awesome to remove the retyping from the front-end, but I think that in Lowering when we have something that's returned in a single register, we should go ahead and retype. I think it would simplify things and would not lose information that the backend would need.

src/coreclr/src/jit/compiler.hpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/compiler.hpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/jitconfigvalues.h Show resolved Hide resolved
src/coreclr/src/jit/compiler.h Show resolved Hide resolved
src/coreclr/src/jit/importer.cpp Outdated Show resolved Hide resolved
BlockRange().InsertAfter(call, bitcast);
callUse.ReplaceWith(comp, bitcast);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see us, perhaps in future, use BITCAST only where we actually require that the bits get moved to a different register file. We should be able to handle those cases much like the way multireg returns are handled. In fact, it's not clear why we need separate handling for that.

src/coreclr/src/jit/lsraxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
@sandreenko
Copy link
Contributor Author

I think this is ready for the second round, maybe expect codegenxarch.cpp.

No diffs if JitAllowStructRetyping=1 (default), when the feature is enabled then it is a regression:

Crossgen CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 79646 (0.24% of base)

but the biggest part of it due to ASG(LCL_VAR struct with 1 field, call) cases, that are part of #34105, without these regressions it is a good improvement.

I have checked SPMI/crossgen/pmi locally. I will kick appropriate CI testing this night.

CORINFO_CLASS_HANDLE valStructHnd = gtGetStructHandleIfPresent(val);
if (varTypeIsStruct(varDsc) && (valStructHnd == NO_CLASS_HANDLE) && !varTypeIsSIMD(valTyp))
{
// That is a very special case when we have lost classHandle from a LCL_FIELD node
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not happy with that solution, but the cost of having fieldSeq with information about overlapping fields was too high, both in terms of TP (checking the new flag in each access to it), in terms of memory consumption (make it 24 bytes instead of 16) and in implementation cost(there are dozens of places where we check for NotAField, many of them are very old and it was unclear how they should work with the new type.
Also, we don't often have overlapping fields, so it felt like doing I was spending too much on such a rare case. The frameworks showed -12 bytes of improvements from my fix that was an awful regression in TP and memory.


static int Main(string[] args)
{
TestClass(2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a class and a struct to do the same logic, guess what case we optimize better. I should probably create an issue about that, but again, overlapping fields are rare.

@sandreenko
Copy link
Contributor Author

sandreenko commented Apr 6, 2020

I would like to see us, perhaps in future, use BITCAST only where we actually require that the bits get moved to a different register file. We should be able to handle those cases much like the way multireg returns are handled. In fact, it's not clear why we need separate handling for that.

Hm, I think we are thinking about two different possible representations, with these changes we have the following new nodes in lowering:
RETURN struct,
IND struct,
call struct;
LCL_VAR/LCL_FLD and similar;
they could be combined in patterns like:
RETURN struct (IND struct(ADDR ref (LCL_VAR struct))),
RETURN struct(call struct),
RETURN struct(SIMD SIMD8),
STORE_OBJ(byref address, call struct).

I want to keep this struct representation visible even after lower, so for STORE_OBJ(LCL_VAR struct, call struct) do STORE_OBJ(byref address, BITCAST<struct>(call long)), the advantage is that we can see where we have structs and where we don't, bitcast nodes could do moves when the use and the def need different registers (like call(call struct)), they produce nothing if they don't need.

Another approach would be to produce the old LIR after lowering, meaning replace all such STORE_OBJ with native types STORE_IND, retype LCL_VAR, CALL into native types and try to avoid having any changes in codegen or lsra. We will need bitcast nodes only for SIMD and LCL_FLD nodes then.
I remember having issues with STORE_IND byref, but that could be fixed. Another issue with that would be STORE_OBJ destination. It is an address, but it knows the type, to which we are storing and it could be hidden deep in the dst tree. Finding it and changing it to a native type in lower, would be expensive, leaving it as-is will create strange trees like STORE_IND int (ADDR ref(LCL_VAR struct)), call int) where dst type != src type.

@sandreenko
Copy link
Contributor Author

ping @dotnet/jit-contrib

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still just a bit apprehensive about using GT_BITCAST for structs, but I think the idea is growing on me.
I'd like to better understand why the elimination of retyping is tied to FEATURE_MULTIREG_RET but otherwise it loosk good.

@@ -430,6 +430,14 @@ CONFIG_INTEGER(JitSaveFpLrWithCalleeSavedRegisters, W("JitSaveFpLrWithCalleeSave
#endif // defined(TARGET_ARM64)
#endif // DEBUG

#if !FEATURE_MULTIREG_RET
CONFIG_INTEGER(JitAllowStructRetyping, W("JitAllowStructRetyping"), 0) // Allow Jit to retype structs as primitive types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will disable this for x64/windows and not for any other targets - is that due to the regressions you described in your PR comments? I'm not sure why this would be tied to FEATURE_MULTIREG_RET.
Also, sorry for the focus on naming - but I think it might be good to give this a name that more clearly reflects that it's the "old" or "bad" way. "Allow" sounds too nice. I always described this as "lying about the types" but maybe that's just a bit too negative. Maybe "JitDoOldStructRetyping"? Or is that too verbose and/or negative?

@sandreenko sandreenko marked this pull request as ready for review April 30, 2020 02:17
@tannergooding
Copy link
Member

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

@CarolEidt
Copy link
Contributor

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

I'll let @sandreenko reply to the specific question of how this work impacts those types (I think it's largely agnostic to those types). In future, however, now that we have a ClassLayout on GenTreeJitIntrinsic we should be able to retain the correct handle (and not have to look them up all over the place).

@AndyAyersMS
Copy link
Member

when the feature is enabled then it is a regression ... but the biggest part of it due to ASG(LCL_VAR struct with 1 field, call) cases, that are part of #34105, without these regressions it is a good improvement.

Can you say more about what the CS/CQ of this looks like once #34105 is fixed?

@sandreenko
Copy link
Contributor Author

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

I'll let @sandreenko reply to the specific question of how this work impacts those types (I think it's largely agnostic to those types). In future, however, now that we have a ClassLayout on GenTreeJitIntrinsic we should be able to retain the correct handle (and not have to look them up all over the place).

The change is agnostic to GT_HWINTRINSIC, it touches TYP_SIMD* a bit, they are the difference between varTypeIsStruct(type) and type == TYP_STRUCT checks. As Carol has said, it is a move towards using the same logic for all varTypeIsStruct(type) like you can see in Lowering::LowerRet(GenTreeUnOp* ret) and some other cases. In future changes we will delete more special handling for TYP_SIMD*.

@sandreenko
Copy link
Contributor Author

Ok, the tests are green now, there were a few new failures due to tail call changes.

The PR is ready for review.

@AndyAyersMS of course, I will repeat analysis for the diffs and post results here.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly comments and some non-blocking suggestions or questions.
Looks good!

@@ -437,6 +437,14 @@ CONFIG_INTEGER(JitSaveFpLrWithCalleeSavedRegisters, W("JitSaveFpLrWithCalleeSave
#endif // defined(TARGET_ARM64)
#endif // DEBUG

#if !FEATURE_MULTIREG_RET
CONFIG_INTEGER(JitDoOldStructRetyping, W("JitDoOldStructRetyping"), 0) // Allow Jit to retype structs as primitive types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new name - thanks :-)

src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
@@ -17277,6 +17309,9 @@ CORINFO_CLASS_HANDLE Compiler::gtGetStructHandleIfPresent(GenTree* tree)
#endif
break;
}
// TODO-1stClassStructs: add a check that `structHnd != NO_CLASS_HANDLE`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this TODO

// Arguments:
// call - The call node to lower.
//
// Note: it transforms the call's user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit unfortunate that we have to transform the call's user. I wonder whether it would be feasible and reasonable to handle these when lowering the user. I think this is fine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we have agreed to do that in the user lowering but when I did so I saw that:

  1. we have to process call in LowerCall when it is unused, so compilation time is spent to find a user in any case;
  2. the processing goes into 2 functions (in addition to LowerCall) LowerBlockStore, LowerStoreIndir and they are platform-specific, so the logic to find these cases should be duplicated for different platforms;
  3. it takes more code and operations to find these cases from the user, because each struct call has its user that should be modified, but not all struct stores have such calls as they src.

so after a while, I changed it to process it in one place, I think it is more visible.

src/coreclr/src/jit/lower.cpp Outdated Show resolved Hide resolved
@sandreenko
Copy link
Contributor Author

@AndyAyersMS
examples of some improvements (I will write a complete report when we enable this by default):

the first type, -580 (-40.93% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BinopEasyOut:TypeToIndex(TypeSymbol):Nullable`1

This is the first type of improvements.
This method returns a struct<8> { 0x0 bool, 0x4 int } and has 30 return instructions, so before the change we were retyping Return struct(LCL_VAR struct V01) to Return long(LCL_FLD long V01) for all of them. It was forbiding independent struct promotion of all involved LCL_VAR. Now we keep them as structs, create a merge return LCL_VAR, but we copy results from LCL_VARs that are promoted independently, so instead of that:

N003 (  3,  4) [000707] -A------R---              *  ASG       long  
N002 (  1,  1) [000706] D------N----              +--*  LCL_VAR   long   V35 tmp32        
N001 (  3,  4) [000078] ------------              \--*  LCL_FLD   long   V11 tmp8         [+0]
                                                  \--*    bool   V11.hasValue (offs=0x00) -> V52 tmp49        
                                                  \--*    int    V11.value (offs=0x04) -> V53 tmp50  

where optimizations of both V35 and V11 are blocked, we have:

N007 ( 16, 12) [000763] -A----------              *  COMMA     void  
N003 (  9,  7) [000759] -A------R---              +--*  ASG       bool  
N002 (  4,  3) [000757] D------N----              |  +--*  LCL_VAR   bool   V94 tmp91        
N001 (  4,  3) [000758] -------N----              |  \--*  LCL_VAR   bool   V52 tmp49        
N006 (  7,  5) [000762] -A------R---              \--*  ASG       int   
N005 (  3,  2) [000760] D------N----                 +--*  LCL_VAR   int    V95 tmp92        
N004 (  3,  2) [000761] -------N----                 \--*  LCL_VAR   int    V53 tmp50

Then we apply CSE and propagation optimizations that benefit from bool lclVars and propagate that field assignment well. As a result, we have 58 Removing tree [TREE_ID] in [BB_ID] as useless (0 in base) and then allocate the rest to registers instead of memory.
ASM before:

IN0011: 000054 mov      byte  ptr [V52 rsp+C0H], 0
IN0012: 00005C xor      eax, eax
IN0013: 00005E mov      dword ptr [V53 rsp+C4H], eax
IN0014: 000065 mov      dword ptr [V53 rsp+C4H], 6
IN0015: 000070 mov      byte  ptr [V52 rsp+C0H], 1
IN0016: 000078 mov      rax, qword ptr [V11 rsp+C0H]

asm after:

IN0011: 000043 mov      eax, 6
IN0012: 000048 mov      ecx, 1
IN0013: 00004D mov      byte  ptr [V94 rsp+20H], cl                                           
IN0014: 000051 mov      dword ptr [V95 rsp+24H], eax
the second type, -203 (-20.86% of base) : Newtonsoft.Json.dasm - DefaultContractResolver:CreatePropertyFromConstructorParameter(JsonProperty,ParameterInfo):JsonProperty:this

This is the second type of improvements, that was the initial motivation for that work (from SIMD bench).

We are inlining a method that returns a small struct:

    [000149] -AC---------              *  ASG       struct (copy)                                                   
    [000147] D------N----              +--*  LCL_VAR   struct<System.Nullable`1[Boolean], 2> V06 loc3         
    [000146] --C---------              \--*  RET_EXPR  struct(inl return from call [000145])

and before we were transforming it to:

    [000150] -AC---------              *  ASG       short 
    [000149] ------------              +--*  IND       short 
    [000148] ------------              |  \--*  ADDR      byref 
    [000147] -------N----              |     \--*  LCL_VAR   struct<System.Nullable`1[Boolean], 2> V06 loc3         
    [000146] --C---------              \--*  RET_EXPR  int   (inl return from call [000145])

that was blocking V06 struct promotion and enregestering, now it does not happen.
so when we copy these values instead of two moves (from a memory loc to a reg and from the reg to another mem loc) we have one reg to reg move:

movzx    rax, byte  ptr [V54 rsp+90H]
mov      byte  ptr [V68 rsp+60H], al

after:

movzx    rcx, dl
the third type, -13 (-81.25% of base) : System.Private.CoreLib.dasm - ValueTuple:Create():ValueTuple

That is a funny one, for IL like:

Importing BB01 (PC=000) of 'System.ValueTuple:Create():System.ValueTuple'
    [ 0]   0 (0x000) ldloca.s 0
    [ 1]   2 (0x002) initobj 020001C9

we were generating

***** BB01
STMT00000 (IL 0x000...0x003)
               [000003] IA----------              *  ASG       struct (init)
               [000000] D------N----              +--*  LCL_VAR   struct<System.ValueTuple, 1> V00 loc0         
               [000002] ------------              \--*  CNS_INT   int    0

***** BB01
STMT00001 (IL 0x008...0x009)
               [000005] ------------              *  RETURN    int   
               [000004] ------------              \--*  LCL_FLD   byte   V00 loc0         [+0]

now it is

N002 (  2,  2) [000005] ------------              *  RETURN    struct
N001 (  1,  1) [000004] ------------              \--*  CNS_INT   int    0

(copy propagation for structs was unblocked in one of the preparation PRs).
so instead of

IN0004: 000000 push     rax
IN0001: 000001 xor      eax, eax
IN0002: 000003 mov      byte  ptr [V00 rsp], al
IN0003: 000006 movsx    rax, byte  ptr [V00 rsp]
IN0005: 00000B add      rsp, 8
IN0006: 00000F ret

we now have

IN0001: 000000 xor      eax, eax
IN0002: 000002 ret

when we create a ValueTuple (wonder if it is popular, like std::map<T,S> in C++).

After #34105 and #11413 are fixed we will probably see some other patterns as well instead of the current regressions.

There are 2 issues that prevent it from being enbaled by default. They are causing significant asm regressions.
@AndyAyersMS
Copy link
Member

examples of some improvements

@sandreenko looks good. I would expect Nullable<int> to be one of the things that really benefits here. You might want to explore add this as an instantiation type in PMI to see more broadly what happens. You should also see the Range type (struct of two ints) benefit, see notes over in #11848.

@sandreenko
Copy link
Contributor Author

@sandreenko looks good. I would expect Nullable to be one of the things that really benefits here. You might want to explore add this as an instantiation type in PMI to see more broadly what happens. You should also see the Range type (struct of two ints) benefit, see notes over in #11848.

yes, all returns and calls that return structs in a register should benefit.

I have added Type[] typesToTry = typeof(int?) and got additional 216 methods improved. The full logs are attached.
I have looked at a few regressed methods with int? and they are in the expected form of ASG(LCL_VAR struct with 1 field, call struct) where we now block independent promotion of the struct.
pmiDiffsWithNullInt.txt
crossgenDiffs.txt
pmiDiffs.txt

@sandreenko sandreenko merged commit 5da855d into dotnet:master May 5, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants