Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization to remove redundant zero initializations. #36918

Merged
merged 2 commits into from
May 26, 2020

Conversation

erozenfeld
Copy link
Member

This change adds a phase that iterates over basic blocks starting with the first
basic block until there is no unique basic block successor or until it detects a
loop. It keeps track of local nodes it encounters. When it gets to an assignment
to a local variable or a local field, it checks whether the assignment is the
first reference to the local (or to the parent of the local field), and, if so,
it may do one of two optimizations:

  1. If the local is untracked, the rhs of the assignment is 0, and the local is
    guaranteed to be fully initialized in the prolog, the explicit zero
    initialization is removed.
  2. If the assignment is to a local (and not a field) and either the local has no
    gc pointers or there are no gc-safe points between the prolog and the
    assignment, it marks the local with lvHasExplicitInit which tells the codegen
    not to insert zero initialization for this local in the prolog.

This addresses one of the examples in #2325 and 5 examples in #1007.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 23, 2020
@erozenfeld
Copy link
Member Author

erozenfeld commented May 23, 2020

x64 framework pmi diffs:

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -51976 (-0.11% of base)
    diff is an improvement.
Top file improvements (bytes):
      -11909 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.21% of base)
       -6564 : Microsoft.CodeAnalysis.CSharp.dasm (-0.15% of base)
       -6051 : System.Linq.Parallel.dasm (-0.36% of base)
       -4403 : Microsoft.CodeAnalysis.dasm (-0.25% of base)
       -3289 : System.Private.CoreLib.dasm (-0.07% of base)
       -1479 : CommandLine.dasm (-0.33% of base)
       -1332 : System.Private.Xml.dasm (-0.04% of base)
       -1005 : System.Linq.dasm (-0.10% of base)
        -928 : System.Data.Common.dasm (-0.06% of base)
        -843 : System.Text.Json.dasm (-0.11% of base)
        -814 : System.Reflection.Metadata.dasm (-0.19% of base)
        -766 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.03% of base)
        -721 : System.Collections.Immutable.dasm (-0.07% of base)
        -644 : System.Drawing.Common.dasm (-0.20% of base)
        -637 : System.Security.Cryptography.Algorithms.dasm (-0.19% of base)
        -596 : System.Security.Cryptography.Pkcs.dasm (-0.13% of base)
        -572 : Microsoft.Diagnostics.FastSerialization.dasm (-0.57% of base)
        -491 : System.Memory.dasm (-0.20% of base)
        -443 : System.Data.OleDb.dasm (-0.15% of base)
        -429 : System.Security.Cryptography.Cng.dasm (-0.22% of base)
105 total files with Code Size differences (105 improved, 0 regressed), 161 unchanged.
Top method regressions (bytes):
          14 ( 5.88% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:EventFieldDeclaration(VariableDeclarationSyntax):EventFieldDeclarationSyntax
          14 ( 4.22% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:DestructorDeclaration(SyntaxToken):DestructorDeclarationSyntax
          14 ( 2.25% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxRemover:VisitToken(SyntaxToken):SyntaxToken:this
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:ModuleStatement(String):ModuleStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:StructureStatement(String):StructureStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:InterfaceStatement(String):InterfaceStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:ClassStatement(String):ClassStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:EnumStatement(String):EnumStatementSyntax
          14 ( 2.25% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxRemover:VisitToken(SyntaxToken):SyntaxToken:this
          13 ( 4.44% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:OperatorDeclaration(TypeSyntax,SyntaxToken):OperatorDeclarationSyntax
          13 ( 4.44% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:ConversionOperatorDeclaration(SyntaxToken,TypeSyntax):ConversionOperatorDeclarationSyntax
          12 ( 0.66% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseModifiers(SyntaxListBuilder):this
          12 ( 2.09% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Identifier(SyntaxTrivia,String,SyntaxTrivia):SyntaxToken
          12 ( 2.01% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Token(SyntaxTrivia,ushort,SyntaxTrivia,String):SyntaxToken
          12 ( 1.92% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Identifier(SyntaxTrivia,String,bool,String,int,SyntaxTrivia):SyntaxToken
           8 ( 0.52% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeTypeDefTable(BlobBuilder,MetadataSizes):this
           8 ( 0.95% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializePropertyTable(BlobBuilder,MetadataSizes):this
           8 ( 0.67% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeExportedTypeTable(BlobBuilder,MetadataSizes):this
           8 ( 0.85% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeGenericParamTable(BlobBuilder,MetadataSizes):this
           8 ( 0.20% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - WppTraceEventParser:CreateTemplatesForTMFFile(Guid,String):List`1:this
Top method improvements (bytes):
        -565 (-6.86% of base) : Microsoft.Diagnostics.FastSerialization.dasm - GrowableArray`1:Foreach(Func`2):GrowableArray`1:this (49 methods)
        -282 (-7.21% of base) : System.Linq.dasm - ConcatNIterator`1:LazyToArray():ref:this (7 methods)
        -264 (-7.67% of base) : System.Linq.dasm - Concat2Iterator`1:ToArray():ref:this (7 methods)
        -246 (-9.82% of base) : System.Linq.dasm - AppendPrependN`1:LazyToArray():ref:this (7 methods)
        -213 (-3.74% of base) : System.Linq.dasm - SelectManySingleSelectorIterator`2:ToArray():ref:this (7 methods)
        -196 (-22.05% of base) : System.Linq.Parallel.dasm - OrderedParallelQuery`1:.ctor(QueryOperator`1):this (7 methods)
        -196 (-27.18% of base) : System.Linq.Parallel.dasm - ParallelQuery`1:.ctor(QuerySettings):this (7 methods)
        -196 (-11.99% of base) : System.Linq.Parallel.dasm - ConcatQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -196 (-12.96% of base) : System.Linq.Parallel.dasm - ConcatQueryOperatorResults:.ctor(QueryResults`1,QueryResults`1,ConcatQueryOperator`1,QuerySettings,bool):this (7 methods)
        -196 (-23.53% of base) : System.Linq.Parallel.dasm - QueryOperator`1:.ctor(bool,QuerySettings):this (7 methods)
        -196 (-16.97% of base) : System.Linq.Parallel.dasm - ReverseQueryOperatorResults:.ctor(QueryResults`1,ReverseQueryOperator`1,QuerySettings,bool):this (7 methods)
        -168 (-28.57% of base) : Microsoft.CodeAnalysis.dasm - <>c__DisplayClass38_1`1:<ExecuteSyntaxNodeAction>b__1():this (6 methods)
        -168 (-7.40% of base) : System.Linq.Parallel.dasm - ExceptQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-7.40% of base) : System.Linq.Parallel.dasm - IntersectQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-7.40% of base) : System.Linq.Parallel.dasm - UnionQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-9.30% of base) : System.Linq.Parallel.dasm - AnyAllSearchOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-9.30% of base) : System.Linq.Parallel.dasm - ContainsSearchOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-9.30% of base) : System.Linq.Parallel.dasm - DefaultIfEmptyQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-9.33% of base) : System.Linq.Parallel.dasm - ElementAtQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
        -168 (-9.33% of base) : System.Linq.Parallel.dasm - FirstQueryOperator`1:Open(QuerySettings,bool):QueryResults`1:this (7 methods)
Top method regressions (percentages):
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:ModuleStatement(String):ModuleStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:StructureStatement(String):StructureStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:InterfaceStatement(String):InterfaceStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:ClassStatement(String):ClassStatementSyntax
          14 ( 5.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:EnumStatement(String):EnumStatementSyntax
          14 ( 5.88% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:EventFieldDeclaration(VariableDeclarationSyntax):EventFieldDeclarationSyntax
          13 ( 4.44% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:OperatorDeclaration(TypeSyntax,SyntaxToken):OperatorDeclarationSyntax
          13 ( 4.44% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:ConversionOperatorDeclaration(SyntaxToken,TypeSyntax):ConversionOperatorDeclarationSyntax
          14 ( 4.22% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxFactory:DestructorDeclaration(SyntaxToken):DestructorDeclarationSyntax
          14 ( 2.25% of base) : Microsoft.CodeAnalysis.CSharp.dasm - SyntaxRemover:VisitToken(SyntaxToken):SyntaxToken:this
          14 ( 2.25% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxRemover:VisitToken(SyntaxToken):SyntaxToken:this
           7 ( 2.24% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Reader:CreateBlendedNode(CSharpSyntaxNode,SyntaxToken):BlendedNode:this
          12 ( 2.09% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Identifier(SyntaxTrivia,String,SyntaxTrivia):SyntaxToken
          12 ( 2.01% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Token(SyntaxTrivia,ushort,SyntaxTrivia,String):SyntaxToken
          12 ( 1.92% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFactory:Identifier(SyntaxTrivia,String,bool,String,int,SyntaxTrivia):SyntaxToken
           6 ( 1.08% of base) : System.Security.Cryptography.Pkcs.dasm - ManagedKeyAgreePal:get_RecipientIdentifier():SubjectIdentifier:this
           7 ( 0.97% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:BindUserDefinedUnaryOperator(VisualBasicSyntaxNode,int,BoundExpression,byref,DiagnosticBag):BoundUserDefinedUnaryOperator:this
           8 ( 0.95% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializePropertyTable(BlobBuilder,MetadataSizes):this
           7 ( 0.93% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:BindUserDefinedNonShortCircuitingBinaryOperator(VisualBasicSyntaxNode,int,BoundExpression,BoundExpression,byref,DiagnosticBag):BoundUserDefinedBinaryOperator:this
           8 ( 0.85% of base) : Microsoft.CodeAnalysis.dasm - MetadataWriter:SerializeGenericParamTable(BlobBuilder,MetadataSizes):this
Top method improvements (percentages):
         -61 (-56.48% of base) : System.Security.Cryptography.Pkcs.dasm - ManagedKeyAgreePal:get_OriginatorIdentifierOrKey():SubjectIdentifierOrKey:this
         -28 (-35.90% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BoundCompoundAssignmentOperator:get_ExpressionSymbol():Symbol:this
         -34 (-35.79% of base) : System.Diagnostics.PerformanceCounter.dasm - InstanceData:get_RawValue():long:this
         -34 (-34.69% of base) : OSExtensions.dasm - ETWKernelControl:IsWin8orNewer():bool
         -37 (-34.26% of base) : Microsoft.CodeAnalysis.CSharp.dasm - InterpolatedStringScanner:ScanInterpolatedStringLiteralNestedVerbatimString():this
         -37 (-33.94% of base) : Microsoft.CodeAnalysis.CSharp.dasm - InterpolatedStringScanner:ScanInterpolatedStringLiteralNestedString(ushort):this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlBoolean():SqlBoolean:this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlByte():SqlByte:this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlInt16():SqlInt16:this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlInt32():SqlInt32:this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlSingle():SqlSingle:this
         -22 (-33.33% of base) : System.Data.Common.dasm - SqlString:ToSqlGuid():SqlGuid:this
         -68 (-32.85% of base) : System.Data.Common.dasm - SqlString:LessThan(SqlString,SqlString):SqlBoolean
         -22 (-32.84% of base) : Microsoft.CodeAnalysis.dasm - XmlLocation:GetHashCode():int:this
         -22 (-32.84% of base) : System.Security.Cryptography.Pkcs.dasm - ManagedKeyTransPal:get_KeyEncryptionAlgorithm():AlgorithmIdentifier:this
         -40 (-32.79% of base) : Microsoft.CodeAnalysis.dasm - ComMemoryStream:Roslyn.Utilities.IUnsafeComStream.Stat(byref,int):this
         -68 (-32.38% of base) : System.Data.Common.dasm - SqlString:Equals(SqlString,SqlString):SqlBoolean
         -68 (-32.38% of base) : System.Data.Common.dasm - SqlString:GreaterThan(SqlString,SqlString):SqlBoolean
         -68 (-32.38% of base) : System.Data.Common.dasm - SqlString:LessThanOrEqual(SqlString,SqlString):SqlBoolean
         -68 (-32.38% of base) : System.Data.Common.dasm - SqlString:GreaterThanOrEqual(SqlString,SqlString):SqlBoolean
3306 total methods with Code Size differences (3273 improved, 33 regressed), 241569 unchanged.```

@erozenfeld
Copy link
Member Author

x64 benchmarks pmi diffs:

PMI CodeSize Diffs for benchstones and benchmarks game in c:\runtime1\artifacts\tests\coreclr\Windows_NT.x64.Release for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -340 (-0.07% of base)
    diff is an improvement.
Top file improvements (bytes):
        -121 : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm (-2.90% of base)
         -91 : Bytemark\Bytemark\Bytemark.dasm (-0.11% of base)
         -19 : Benchstones\BenchI\LogicArray\LogicArray\LogicArray.dasm (-1.19% of base)
         -12 : Linq\Linq\Linq.dasm (-0.05% of base)
         -11 : Benchstones\BenchF\NewtR\NewtR\NewtR.dasm (-1.10% of base)
         -11 : Benchstones\BenchF\Secant\Secant\Secant.dasm (-0.99% of base)
         -10 : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm (-0.03% of base)
          -8 : BenchmarksGame\reverse-complement\reverse-complement-1\reverse-complement-1.dasm (-0.22% of base)
          -7 : Benchstones\BenchF\MatInv4\MatInv4\MatInv4.dasm (-0.16% of base)
          -7 : Benchstones\BenchI\AddArray2\AddArray2\AddArray2.dasm (-0.38% of base)
          -6 : BenchmarksGame\k-nucleotide\k-nucleotide-9\k-nucleotide-9.dasm (-0.06% of base)
          -6 : Benchstones\BenchF\Bisect\Bisect\Bisect.dasm (-0.53% of base)
          -6 : Benchstones\BenchF\Regula\Regula\Regula.dasm (-0.40% of base)
          -6 : Benchstones\BenchI\8Queens\8Queens\8Queens.dasm (-0.49% of base)
          -5 : Math\Functions\Functions\Functions.dasm (-0.01% of base)
          -5 : SIMD\RayTracer\RayTracer\RayTracer.dasm (-0.02% of base)
          -5 : Benchstones\BenchF\InvMt\InvMt\InvMt.dasm (-0.22% of base)
          -4 : BenchmarksGame\reverse-complement\reverse-complement-6\reverse-complement-6.dasm (-0.08% of base)
18 total files with Code Size differences (18 improved, 0 regressed), 64 unchanged.
Top method improvements (bytes):
         -51 (-4.26% of base) : Bytemark\Bytemark\Bytemark.dasm - EMFloat:InternalFPFToString(byref,byref):int
         -43 (-11.05% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:Run(bool):this
         -39 (-3.49% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:compose_r(int,int,int,int):this
         -34 (-9.58% of base) : Bytemark\Bytemark\Bytemark.dasm - EMFloat:SetupCPUEmFloatArrays(ref,ref,ref,int)
         -26 (-2.32% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:compose_l(int,int,int,int):this
         -19 (-18.27% of base) : Benchstones\BenchI\LogicArray\LogicArray\LogicArray.dasm - LogicArray:Bench():bool
         -13 (-2.43% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:extract(int):int:this
         -12 (-15.79% of base) : Linq\Linq\Linq.dasm - <>c:<Order00Manual>b__34_0(Product,Product):int:this
         -11 (-4.21% of base) : Benchstones\BenchF\NewtR\NewtR\NewtR.dasm - NewtR:Bench():bool
         -11 (-3.65% of base) : Benchstones\BenchF\Secant\Secant\Secant.dasm - Secant:Bench():bool
         -10 (-1.09% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ConsoleMandel:Main(ref):int
          -8 (-10.53% of base) : BenchmarksGame\reverse-complement\reverse-complement-1\reverse-complement-1.dasm - Block:IndexOf(ubyte,int):Index:this
          -7 (-1.10% of base) : Benchstones\BenchF\MatInv4\MatInv4\MatInv4.dasm - MatInv4:Bench():bool
          -7 (-4.64% of base) : Benchstones\BenchI\AddArray2\AddArray2\AddArray2.dasm - AddArray2:Bench(ref):bool
          -6 (-1.91% of base) : Bytemark\Bytemark\Bytemark.dasm - BitOps:Run():double:this
          -6 (-0.49% of base) : BenchmarksGame\k-nucleotide\k-nucleotide-9\k-nucleotide-9.dasm - KNucleotide_9:loadThreeData(Stream)
          -6 (-1.73% of base) : Benchstones\BenchF\Bisect\Bisect\Bisect.dasm - Bisect:Bench():bool
          -6 (-1.51% of base) : Benchstones\BenchF\Regula\Regula\Regula.dasm - Regula:Bench():bool
          -6 (-2.83% of base) : Benchstones\BenchI\8Queens\8Queens\8Queens.dasm - EightQueens:Bench():bool
          -5 (-0.45% of base) : Math\Functions\Functions\Functions.dasm - Program:Main(ref):int
Top method improvements (percentages):
         -19 (-18.27% of base) : Benchstones\BenchI\LogicArray\LogicArray\LogicArray.dasm - LogicArray:Bench():bool
         -12 (-15.79% of base) : Linq\Linq\Linq.dasm - <>c:<Order00Manual>b__34_0(Product,Product):int:this
         -43 (-11.05% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:Run(bool):this
          -8 (-10.53% of base) : BenchmarksGame\reverse-complement\reverse-complement-1\reverse-complement-1.dasm - Block:IndexOf(ubyte,int):Index:this
         -34 (-9.58% of base) : Bytemark\Bytemark\Bytemark.dasm - EMFloat:SetupCPUEmFloatArrays(ref,ref,ref,int)
          -7 (-4.64% of base) : Benchstones\BenchI\AddArray2\AddArray2\AddArray2.dasm - AddArray2:Bench(ref):bool
         -51 (-4.26% of base) : Bytemark\Bytemark\Bytemark.dasm - EMFloat:InternalFPFToString(byref,byref):int
         -11 (-4.21% of base) : Benchstones\BenchF\NewtR\NewtR\NewtR.dasm - NewtR:Bench():bool
          -4 (-4.12% of base) : BenchmarksGame\reverse-complement\reverse-complement-6\reverse-complement-6.dasm - ReverseComplement_6:tryTake(BlockingCollection`1,byref):bool
         -11 (-3.65% of base) : Benchstones\BenchF\Secant\Secant\Secant.dasm - Secant:Bench():bool
         -39 (-3.49% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:compose_r(int,int,int,int):this
          -6 (-2.83% of base) : Benchstones\BenchI\8Queens\8Queens\8Queens.dasm - EightQueens:Bench():bool
         -13 (-2.43% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:extract(int):int:this
         -26 (-2.32% of base) : BenchmarksGame\pidigits\pidigits-3\pidigits-3.dasm - pidigits:compose_l(int,int,int,int):this
          -5 (-2.24% of base) : SIMD\RayTracer\RayTracer\RayTracer.dasm - RayTracer:GetHueShift(int):float:this
          -6 (-1.91% of base) : Bytemark\Bytemark\Bytemark.dasm - BitOps:Run():double:this
          -6 (-1.73% of base) : Benchstones\BenchF\Bisect\Bisect\Bisect.dasm - Bisect:Bench():bool
          -6 (-1.51% of base) : Benchstones\BenchF\Regula\Regula\Regula.dasm - Regula:Bench():bool
          -7 (-1.10% of base) : Benchstones\BenchF\MatInv4\MatInv4\MatInv4.dasm - MatInv4:Bench():bool
         -10 (-1.09% of base) : SIMD\ConsoleMandel\ConsoleMandel\ConsoleMandel.dasm - ConsoleMandel:Main(ref):int
23 total methods with Code Size differences (23 improved, 0 regressed), 1870 unchanged.

@erozenfeld
Copy link
Member Author

Example where the optimization removed a prolog zero initialization:

Assembly listing for method Kernel32:GetEnvironmentVariable(String,Span`1):int
G_M5523_IG01:
       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+30H]
-      xor      rax, rax
-      mov      qword ptr [rbp-08H], rax
G_M5523_IG02:
       mov      r8, bword ptr [rdx]
       mov      bword ptr [rbp-08H], r8
       mov      qword ptr [rbp-10H], r8
       mov      r8d, dword ptr [rdx+8]
       mov      rdx, qword ptr [rbp-10H]
       call     Kernel32:GetEnvironmentVariable(String,long,int):int
       nop      					
G_M5523_IG03:
       lea      rsp, [rbp]
       pop      rbp
       ret      
-; Total bytes of code 47, prolog size 16, PerfScore 17.95, (MethodHash=18b8ea6c) for method Kernel32:GetEnvironmentVariable(String,Span`1):int
+; Total bytes of code 41, prolog size 10, PerfScore 16.10, (MethodHash=18b8ea6c) for method Kernel32:GetEnvironmentVariable(String,Span`1):int

@erozenfeld
Copy link
Member Author

Example where the optimization removed a prolog zero initialization for one variable and and an explicit zero initialization for another variable:

; Assembly listing for method RuntimeAssembly:GetCodeBase(bool):String:this
G_M43056_IG01:
       push     rbp
       sub      rsp, 64
       lea      rbp, [rsp+40H]
       xor      rax, rax
       mov      qword ptr [rbp-08H], rax
-      mov      qword ptr [rbp-10H], rax	
G_M43056_IG02:
-      xor      r8, r8
-      mov      gword ptr [rbp-08H], r8
       mov      gword ptr [rbp-10H], rcx
       lea      rcx, [rbp-10H]
       mov      r8, gword ptr [rbp-10H]
       mov      r8, qword ptr [r8+32]
       lea      rax, [rbp-08H]
       lea      r9, bword ptr [rbp-20H]
       mov      qword ptr [r9], rcx
       mov      qword ptr [r9+8], r8
       lea      rcx, bword ptr [rbp-20H]
       movzx    rdx, dl
       mov      r8, rax
       call     RuntimeAssembly:GetCodeBase(QCallAssembly,bool,StringHandleOnStack)
       mov      rax, gword ptr [rbp-08H]
G_M43056_IG03:
       lea      rsp, [rbp]
       pop      rbp
       ret      

-; Total bytes of code 83, prolog size 20, PerfScore 26.05, (MethodHash=dd2357cf) for method RuntimeAssembly:GetCodeBase(bool):String:this
+; Total bytes of code 72, prolog size 16, PerfScore 22.70, (MethodHash=dd2357cf) for method RuntimeAssembly:GetCodeBase(bool):String:this

@erozenfeld
Copy link
Member Author

erozenfeld commented May 23, 2020

Regressions are all cases where we initialize less memory in the prolog but that results in more static instructions. We initialize in a loop with 48 bytes zero-ed on each iteration so we end up with the minimum number of static instructions when we initialize 0 mod 48 bytes:

Assembly listing for method MetadataWriter:SerializePropertyTable(BlobBuilder,MetadataSizes):this
G_M36805_IG01:
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 176
       vzeroupper 
       vxorps   xmm4, xmm4
-      mov      rax, -144
-      vmovdqa  xmmword ptr [rsp+rax+B0H], xmm4
-      vmovdqa  xmmword ptr [rsp+rax+C0H], xmm4
-      vmovdqa  xmmword ptr [rsp+rax+D0H], xmm4
+      mov      rax, -96
+      vmovdqa  xmmword ptr [rsp+rax+80H], xmm4
+      vmovdqa  xmmword ptr [rsp+rax+90H], xmm4
+      vmovdqa  xmmword ptr [rsp+rax+A0H], xmm4
       add      rax, 48
       jne      SHORT  -5 instr
+      mov      qword ptr [rsp+80H], rax
       mov      rdi, rcx
       mov      rsi, rdx
       mov      rbx, r8
						;; bbWeight=1    PerfScore 11.83
						;; bbWeight=1    PerfScore 12.83
G_M36805_IG02:
       mov      rcx, gword ptr [rdi+496]
       mov      gword ptr [rsp+48H], rcx
       xor      eax, eax
       mov      dword ptr [rsp+50H], eax
       mov      ecx, dword ptr [rcx+20]
       mov      dword ptr [rsp+54H], ecx
       vxorps   xmm0, xmm0
       vmovdqu  xmmword ptr [rsp+58H], xmm0
       mov      qword ptr [rsp+68H], rax
						;; bbWeight=1    PerfScore 9.58
G_M36805_IG03:
       vmovdqu  xmm0, xmmword ptr [rsp+48H]
       vmovdqu  xmmword ptr [rsp+88H], xmm0
       vmovdqu  xmm0, xmmword ptr [rsp+58H]
       vmovdqu  xmmword ptr [rsp+98H], xmm0
       mov      rcx, qword ptr [rsp+68H]
       mov      qword ptr [rsp+A8H], rcx

@erozenfeld
Copy link
Member Author

pin-icount of SPC crossgen shows 0.13% throughput regression. Note that this phase doesn't run with minopts.

@erozenfeld
Copy link
Member Author

@dotnet/jit-contrib @AndyAyersMS PTAL

@erozenfeld erozenfeld mentioned this pull request May 23, 2020
@benaadams
Copy link
Member

Looks like an assert is failing? https://helix.dot.net/api/2019-06-17/jobs/51a5759f-188d-469c-b161-07b2bb034fbb/workitems/JIT.Directed/files/console.2520679e.log

JIT\Directed\nullabletypes\castclassvaluetype_ro\castclassvaluetype_ro.cmd [FAIL]

Assert failure(PID 3556 [0x00000de4], Thread: 3452 [0x0d7c]): 
Assertion failed 'OperIs(GT_CNS_INT)' in 'NullableTest9:Run()' during 
  'Redundant zero Inits' (IL size 534)

File: F:\workspace\_work\1\s\src\coreclr\src\jit\gtstructs.h Line: 60
          Image: C:\h\w\A2FB0941\p\CoreRun.exe

BEGIN EXECUTION
       "C:\h\w\A2FB0941\p\corerun.exe" castclassvaluetype_ro.dll 
      --- char? s = Helper.Create(default(char)) ---
      --- char? s = null ---
      --- char u = Helper.Create(default(char)) ---
      --- bool? s = Helper.Create(default(bool)) ---
      --- bool? s = null ---
      --- bool u = Helper.Create(default(bool)) ---
      --- byte? s = Helper.Create(default(byte)) ---
      --- byte? s = null ---
      --- byte u = Helper.Create(default(byte)) ---
      --- sbyte? s = Helper.Create(default(sbyte)) ---
      --- sbyte? s = null ---
      --- sbyte u = Helper.Create(default(sbyte)) ---
      --- short? s = Helper.Create(default(short)) ---
      --- short? s = null ---
      --- short u = Helper.Create(default(short)) ---
      --- ushort? s = Helper.Create(default(ushort)) ---
      --- ushort? s = null ---
      --- ushort u = Helper.Create(default(ushort)) ---
      --- int? s = Helper.Create(default(int)) ---
      --- int? s = null ---
      --- int u = Helper.Create(default(int)) ---
      --- uint? s = Helper.Create(default(uint)) ---
      --- uint? s = null ---
      --- uint u = Helper.Create(default(uint)) ---
      Expected: 100
      Actual: -1073740286
      END EXECUTION - FAILED
      FAILED
      Test Harness Exitcode is : 1

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few notes for clarification / consideration.

Did you look at handling more general flow patterns?

@@ -4690,6 +4690,8 @@ void Compiler::compCompile(void** methodCodePtr, ULONG* methodCodeSize, JitFlags
DoPhase(this, PHASE_BUILD_SSA, &Compiler::fgSsaBuild);
}

DoPhase(this, PHASE_ZERO_INITS, &Compiler::optRemoveRedundantZeroInits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be less confusing to run this before we build SSA, since it doesn't really leverage or involve SSA (other than skipping phi defs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization needs to run after liveness so that it can use lvTracked and lvLiveInOutOfHndlr. We run liveness when we build SSA.

{
if ((tree->gtFlags & GTF_CALL) != 0)
{
hasGCSafePoint = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have some calls (GTF_CALL_M_SUPPRESS_GC_TRANSITION) that are not safe points? I suppose it is conservatively correct to assume all calls are GC safepoints though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added a check for calls with GTF_CALL_M_SUPPRESS_GC_TRANSITION. No new diffs in framework and benchmarks.

(!GetInterruptible() && !hasGCSafePoint && !compMethodRequiresPInvokeFrame()))
{
// The local hasn't been used and won't be reported to the gc between
// the prolog and this explicit zero intialization. Therefore, it doesn't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't check for zero init, it could be any init, so update the comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!lclDsc->lvTracked && treeOp->gtOp2->IsIntegralConst() &&
(treeOp->gtOp2->AsIntCon()->IconValue() == 0))
{
bool bbInALoop = (block->bbFlags & BBF_BACKWARD_JUMP) != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BBF_BACKWARD_JUMP may be too conservative? Seems like we should have enough graph analysis state lying around to detect loops more accurately. Might not be worth the trouble.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around and found that we call optMarkLoopBlocks but that just sets the weights for the blocks and it doesn't feel right to use that here. I'll leave it for later to try to add a basic block flag and use it in this optimization. I doubt it will make much of a difference.

// We are guaranteed to have a zero initialization in the prolog and
// the local hasn't been redefined between the prolog and this explicit
// zero initialization so the assignment can be safely removed.
if (tree == stmt->GetRootNode())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-root assignments can't we just bash the asg tree to a nop? Or if bashing during walking is too painful, keep track of these trees and bash them later?

Or are these rare enough that it's not worth trying to handle them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that and saw no new diffs in framework and benchmarks.

{
if (!lclDsc->HasGCPtr() ||
(!GetInterruptible() && !hasGCSafePoint && !compMethodRequiresPInvokeFrame()))
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to your comment above or below and explain why we're checking compMethodRequiresPInvokeFrame here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case GT_LCL_FLD:
case GT_LCL_VAR_ADDR:
case GT_LCL_FLD_ADDR:
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-address exposed locals we only care about seeing defs, not uses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We care about seeing defs when removing explicit zero initializations and we care about seeing uses when marking locals with lvHasExplicitInit. I tried a change where I tracked defs and uses separately and saw no new diffs in framework or benchmarks so will keep it simple.

@erozenfeld erozenfeld force-pushed the ZeroInitDiff branch 4 times, most recently from 9474931 to 997792a Compare May 25, 2020 04:17
@erozenfeld
Copy link
Member Author

Did you look at handling more general flow patterns?

I didn't try to handle more general flow patterns. I suspect it's not worth it as we'll find few cases and the optimization will be more expensive. I'll experiment with that outside of this PR.

@erozenfeld
Copy link
Member Author

@AndyAyersMS I pushed a commit that addressed your feedback and also fixed several issues:

  1. We should keep prolog zero initialization if the local is lvLiveInOutOfHndlr and
    some node between the prolog and the explicit assignment may throw. This fixed several test failures but resulted in just one framework diff.

  2. Some variables are never zero initialized in the prolog. This commit
    disables the optimization for them. No new diffs from this change.

  3. Fixed the statement that incremented the ref count: msvc and clang had
    different behavior so many tests on Linux were failing.

Only one change in framework diffs from what I quoted above. Current numbers:

Framework:

Total bytes of diff: -51984 (-0.11% of base)
3305 total methods with Code Size differences (3273 improved, 32 regressed), 241570 unchanged.

Benchmarks:

Total bytes of diff: -340 (-0.07% of base)
23 total methods with Code Size differences (23 improved, 0 regressed), 1870 unchanged.

PTAL

This change adds a phase that iterates over basic blocks starting with the first
basic block until there is no unique basic block successor or until it detects a
loop. It keeps track of local nodes it encounters. When it gets to an assignment
to a local variable or a local field, it checks whether the assignment is the
first reference to the local (or to the parent of the local field), and, if so,
it may do one of two optimizations:
1. If the local is untracked, the rhs of the assignment is 0, and the local is
   guaranteed to be fully initialized in the prolog, the explicit zero
   initialization is removed.
2. If the assignment is to a local (and not a field) and either the local has no
   gc pointers or there are no gc-safe points between the prolog and the
   assignment, it marks the local with lvHasExplicitInit which tells the codegen
   not to insert zero initialization for this local in the prolog.

This addresses one of the examples in dotnet#2325 and 5 examples in dotnet#1007.
We should keep prolog zero initialization if the local is lvLiveInOutOfHndlr and
some node between the prolog and the explicit assignment may throw.

Some variables are never zero initialized in the prolog. This commit
disables the optimization for them.

Fixed the statement that incremented the ref count: msvc and clang had
different behavior so test on Linux were failing.

Code review feedback:
relaxed the gc-safe point condition to recognize calls with suppressed
gc transitions;
fixed comments.
@erozenfeld
Copy link
Member Author

gcstress runs show the same failures with and without my changes except for WeakReferenceTest in Windows x64 checked 0x3 and 0xc runs and Windows x86 0x3 run. The symptoms are the same as in #36970. #37002 is the proposed fix.

I tried to run this test locally and I see the same behavior with and without my changes. So, I'll ignore this failure.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this version looks good.

@erozenfeld erozenfeld merged commit f6ba9ea into dotnet:master May 26, 2020
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - only a couple of minor observations, but no need to change for this PR

@@ -13883,7 +13883,7 @@ void Compiler::impImportBlockCode(BasicBlock* block)
bool bbIsReturn = (block->bbJumpKind == BBJ_RETURN) &&
(!compIsForInlining() || (impInlineInfo->iciBlock->bbJumpKind == BBJ_RETURN));
LclVarDsc* const lclDsc = lvaGetDesc(lclNum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but this code inexplicably usees both lclDsc and lvaTable[lclNum]

LclVarDsc* const lclDsc = lvaGetDesc(lclNum);
unsigned* pRefCount = refCounts.LookupPointer(lclNum);
assert(pRefCount != nullptr);
if (*pRefCount == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion - it might be worth a comment (or perhaps add to the comment below) that *pRefCount can never be null, because the local node on the rhs of the assignment must have already been seend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a comment with my next PR.

erozenfeld added a commit to erozenfeld/runtime that referenced this pull request Jun 8, 2020
This is a follow-up to dotnet#36918. It addresses one of the examples in dotnet#1007
where we remove a struct zero initialization but fail to clean up a dead
field assignment.

The change is not to mark a dependently promoted field as untracked
if we know that the struct local is no longer referenced.

I also addressed a couple of late cosmetic review comments from dotnet#36918.

No diffs in framework and benchmarks.
erozenfeld added a commit that referenced this pull request Jun 9, 2020
)

This is a follow-up to #36918. It addresses one of the examples in #1007
where we remove a struct zero initialization but fail to clean up a dead
field assignment.

The change is not to mark a dependently promoted field as untracked
if we know that the struct local is no longer referenced.

I also addressed a couple of late cosmetic review comments from #36918.

No diffs in framework and benchmarks.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants