IL Generation: struct and struct records methods are slower #5136

thinkbeforecoding · 2018-06-07T20:30:04Z

Doing a diff between structs and struct records generated code, I noticed that an extra copy is done in methods using copy and update.

Repro steps

Create a struct type
Create a struct record
implement the same method that copy and update the type
compare IL both

[<Struct>]
type Structure = 
    val Line: int
    val OriginalLine: int
    val StartOfLineAbsoluteOffset: int
    new(l,ol,s) = { Line =l; OriginalLine = ol; StartOfLineAbsoluteOffset = s }
    member x.NextLine =
        Structure(x.Line+1, x.OriginalLine+1,x.StartOfLineAbsoluteOffset)
[<Struct>]
type Rec = {
    Line: int
    OriginalLine: int
    StartOfLineAbsoluteOffset: int
} with
    member x.NextLine =
        { x with Line = x.Line + 1; OriginalLine = x.OriginalLine+1}

Expected behavior

IL should be the same

Actual behavior

There is a few extra instructions for struct records:

  .maxstack  5
  .locals init (valuetype Program/Rec V_0)
  IL_0000:  ldarg.0
  IL_0001:  ldobj      Program/Rec
  IL_0006:  stloc.0

the rest is exactly similar.

Known workarounds

Use struct types.

Related information

Is this needed ? The Structure.NextLine seems ok without it.

These lines seems generated by a call to mkAddrGet function (TastOps.fs) for byref value types.

The text was updated successfully, but these errors were encountered:

thinkbeforecoding · 2018-06-08T09:26:13Z

I changed the benchmark and I get something even more disturbing:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-4650U CPU 1.70GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.1.300
  [Host] : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT DEBUG
  Core   : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT

Job=Core  Runtime=Core

|   Method |      Mean |     Error |    StdDev |    Median | Scaled | ScaledSD |
|--------- |----------:|----------:|----------:|----------:|-------:|---------:|
|   Struct | 291.24 us | 1.7226 us | 1.6113 us | 291.22 us |   6.51 |     0.32 |
| Struct_f |  43.64 us | 0.8384 us | 0.8610 us |  43.63 us |   0.98 |     0.05 |
|    Union |  43.56 us | 0.4553 us | 0.4036 us |  43.48 us |   0.97 |     0.05 |
|  Union_f |  43.42 us | 0.4898 us | 0.3824 us |  43.49 us |   0.97 |     0.05 |
|      Rec | 289.47 us | 2.5421 us | 2.2535 us | 289.27 us |   6.47 |     0.32 |
|    Rec_f |  43.79 us | 0.8497 us | 0.9444 us |  43.33 us |   0.98 |     0.05 |
|      Int |  44.84 us | 0.9546 us | 2.4470 us |  43.96 us |   1.00 |     0.00 |

For all these tests, I'm using a struct that contains a single int, and an operation next that build a new struct containing an incremented value.

The Int tests is directly operating on an int value for reference and is used as a baseline.

The tests with suffix _f are using a function, while those without the suffix use a member property x.Next with the same implementation.

The disturbing thing is that for struct and struct records, the property version is 6.5 slower that others.

thinkbeforecoding · 2018-06-08T09:27:58Z

Here is the code for the different structs:

[<Struct>]
type SingleStruct =
    val value: int
    new(value) = { value = value }
    member x.Next = SingleStruct(x.value + 1)

let snext (x:SingleStruct) = SingleStruct(x.value + 1)

[<Struct>]
type SingleCase = SingleCase of int
    with
    member x.Next = let (SingleCase v) = x in SingleCase(v+1)

let scnext (SingleCase v) = SingleCase(v+1)

[<Struct>]
type SingleRec = { value : int }
    with
    member x.Next = { value = x.value + 1 }

let rnext x = { value = x.value + 1 }

thinkbeforecoding · 2018-06-08T09:28:30Z

And the benchmark:

[<CoreJob>]
type Benchm2() =
    [<Benchmark>]
    member __.Struct() =
        let mutable s = SingleStruct(0)
        for _ in 1 .. 100000 do
            s <- s.Next
    [<Benchmark>]
    member __.Struct_f() =
        let mutable s = SingleStruct(0)
        for _ in 1 .. 100000 do
            s <- snext s
    [<Benchmark>]
    member __.Union() =
        let mutable s = SingleCase 0
        for _ in 1 .. 100000 do
            s <- s.Next
    [<Benchmark>]
    member __.Union_f() =
        let mutable s = SingleCase 0
        for _ in 1 .. 100000 do
            s <- scnext s 
    [<Benchmark>]
    member __.Rec() =
        let mutable s = { value = 0 }
        for _ in 1 .. 100000 do
            s <- s.Next
    [<Benchmark>]
    member __.Rec_f() =
        let mutable s = { value = 0 }
        for _ in 1 .. 100000 do
            s <- rnext s
    [<Benchmark(Baseline=true)>]
    member __.Int() =
        let mutable s = 0
        for _ in 1 .. 100000 do
            s <- s + 1

thinkbeforecoding · 2018-06-08T09:39:58Z

Several interesting point looking at the IL:

SingleCase.Next IL is bigger (24) than SingleRec.Next and SingleStruct (14), so it is not inlined by the F# compiler in the loop. Then the JIT is doing a better job at inlining everything.
In property and function tests, SingleRect operation gets inlined but in a slightly different way. One is the JITted and inlined, the other is not. Same thiing for SingleStruct.

Here is the inlined part of the loop for StringStruct when using the .Next property :

  // using SingleRec.Next

  IL_000b:  ldloca.s   V_0
  IL_000d:  stloc.2
  IL_000e:  ldloc.2
  IL_000f:  ldfld      int32 Program/SingleRec::value@
  IL_0014:  ldc.i4.1
  IL_0015:  add
  IL_0016:  newobj     instance void Program/SingleRec::.ctor(int32)

and the version when using the function:

  // using rnext

  IL_000b:  ldloc.0
  IL_000c:  stloc.2
  IL_000d:  ldloca.s   V_2
  IL_000f:  ldfld      int32 Program/SingleRec::value@
  IL_0014:  ldc.i4.1
  IL_0015:  add
  IL_0016:  newobj     instance void Program/SingleRec::.ctor(int32)

Same number of instructions, but with a slight different layout for the 3 first instructions. The second one get simplified by the JIT to code equivalent to using directly an Int32, the first one takes 6.5 times more time.

zpodlovics · 2018-06-08T14:46:10Z

@thinkbeforecoding You pass the struct by value to the functions and I guess it passed by value to the extension methods too.

Fun fact: you can have near native int speed with ref structs (at least with the C# implementation right now). Unfortunately NET Core 2.1 F# compiler does not have Span support yet (ref struct shipped with this) but you can try it out with .NET (afaik it shipped with VS 15.8): #4888

using System;
using System.Runtime.CompilerServices;

namespace CSharpStruct
{
    public ref struct CSharpStruct    
    {
        private int _value;
        public CSharpStruct(int value) 
        {
            _value = value;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public CSharpStruct Next() 
        {
            return new CSharpStruct(_value+1);
        }

    }
}


    Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
---------- |----------:|----------:|----------:|-------:|---------:|
    Struct | 257.85 us | 1.1275 us | 0.9995 us |   5.09 |     0.02 |
  Struct_f |  73.89 us | 1.0780 us | 1.0084 us |   1.46 |     0.02 |
     Union |  75.53 us | 1.5103 us | 1.4128 us |   1.49 |     0.03 |
   Union_f |  75.73 us | 1.3336 us | 1.2474 us |   1.50 |     0.02 |
       Rec | 260.04 us | 0.8637 us | 0.8079 us |   5.14 |     0.02 |
     Rec_f |  75.71 us | 1.1767 us | 1.1007 us |   1.50 |     0.02 |
 RefStruct |  50.46 us | 0.6117 us | 0.5722 us |   1.00 |     0.01 |
       Int |  50.63 us | 0.1654 us | 0.1547 us |   1.00 |     0.00 |

Example project - based on your code - are attached:

visualfsharp_5136.zip

thinkbeforecoding · 2018-06-08T14:53:39Z

Strange that you have 1.5x time for Struct Union where I have something closer to 1.0x ??
Where does the difference come from ?

zpodlovics · 2018-06-08T14:58:59Z

@thinkbeforecoding I guess it's due the different method alignment and platform ABI (and optimization levels / bugs) for passing structs: https://github.com/dotnet/coreclr/issues/16619 and https://github.com/dotnet/coreclr/issues/6264

My benchmark was ran on Ubuntu 16.04 x86_64 & .NET Core 2.1.

zpodlovics · 2018-06-08T15:12:17Z

Another run, with some small modification - passing byrefs to snext and rnext functions will makes it lot slower. At least the ref struct performance is consistent.

        Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
-------------- |----------:|----------:|----------:|-------:|---------:|
        Struct | 268.05 us | 4.4208 us | 4.1352 us |   5.14 |     0.09 |
 Struct_fByRef | 264.17 us | 3.1749 us | 2.8145 us |   5.06 |     0.07 |
         Union |  50.75 us | 0.1988 us | 0.1763 us |   0.97 |     0.01 |
       Union_f |  51.75 us | 0.3163 us | 0.2958 us |   0.99 |     0.01 |
           Rec | 260.26 us | 1.1121 us | 0.9858 us |   4.99 |     0.05 |
    Rec_fByRef | 261.79 us | 1.8999 us | 1.7772 us |   5.02 |     0.06 |
     RefStruct |  51.60 us | 0.2482 us | 0.2200 us |   0.99 |     0.01 |
           Int |  52.20 us | 0.5446 us | 0.5094 us |   1.00 |     0.00 |

dsyme · 2018-06-08T18:58:30Z

I haven't followed the whole discussion but I looked at the IL difference between Rec and Rec_f and the differences seem sooooo tiny:

Is it just some random alignment problem or is there a genuine issue here?

dsyme · 2018-06-08T19:11:44Z

This feels related to #2688.

Basically I think

match x with StructUnion v -> E

and

{ x with c = E }

should both respect the byrefness of x in the case that x is a byref pointer. At the moment these become:

let matchValue = implicit-deref(x) in let v = matchValue.Item in E

and

let recdValue = implicit-deref(x) in { a=recdValue.a; b=recdValue.b; c=E }

They should become:

let v = x.Item in E

and

{ a=x.a; b=x.b; c=E }

dsyme · 2018-06-08T19:29:10Z

@abelbraaksma Ah no, I don't think that's the cause of the difference between Rec and Rec_f after all. It's still a good thing to fix however

zpodlovics · 2018-06-08T19:36:06Z

@dsyme The performance overhead seems to be a legitimate issue here. It also seems that a small difference in the IL could make a huge difference in performance. Passing a struct by reference with 5x performance penality seems just too high to me.

Update: It seems that Rec and Rec_fByRef have exactly the same generated bytecode.

BenchmarkDotNet=v0.10.14, OS=ubuntu 16.04
AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G, 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=2.1.300
  [Host]     : .NET Core ? (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT DEBUG
  DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT


        Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
-------------- |----------:|----------:|----------:|-------:|---------:|
        Struct | 262.62 us | 0.6720 us | 0.5957 us |   5.17 |     0.04 |
      Struct_f |  75.30 us | 0.8854 us | 0.8282 us |   1.48 |     0.02 |
 Struct_fByRef | 267.37 us | 2.6758 us | 2.5029 us |   5.26 |     0.06 |
         Union |  51.75 us | 0.2702 us | 0.2527 us |   1.02 |     0.01 |
       Union_f |  51.90 us | 0.2566 us | 0.2400 us |   1.02 |     0.01 |
  Union_fByRef |  74.66 us | 0.9415 us | 0.8807 us |   1.47 |     0.02 |
           Rec | 262.06 us | 1.7891 us | 1.6736 us |   5.16 |     0.05 |
         Rec_f |  73.72 us | 0.6431 us | 0.5701 us |   1.45 |     0.02 |
    Rec_fByRef | 262.53 us | 0.7352 us | 0.6517 us |   5.17 |     0.04 |
     RefStruct |  50.71 us | 0.3437 us | 0.3047 us |   1.00 |     0.01 |
           Int |  50.80 us | 0.4300 us | 0.4022 us |   1.00 |     0.00 |

// * Hints *
Outliers
  Benchm2.Struct: Default     -> 1 outlier  was  removed
  Benchm2.Rec_f: Default      -> 1 outlier  was  removed
  Benchm2.Rec_fByRef: Default -> 1 outlier  was  removed
  Benchm2.RefStruct: Default  -> 1 outlier  was  removed

Rec:

.method public hidebysig instance void 
            Rec() cil managed
    {
      .custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 00 00 ) 
      // Code size       41 (0x29)
      .maxstack  4
      .locals init (valuetype Program/SingleRec V_0,
               int32 V_1,
               valuetype Program/SingleRec& V_2)
      IL_0000:  ldc.i4.0
      IL_0001:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_0006:  stloc.0
      IL_0007:  ldc.i4.1
      IL_0008:  stloc.1
      IL_0009:  br.s       IL_0020

      IL_000b:  ldloca.s   V_0
      IL_000d:  stloc.2
      IL_000e:  ldloc.2
      IL_000f:  ldfld      int32 Program/SingleRec::value@
      IL_0014:  ldc.i4.1
      IL_0015:  add
      IL_0016:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_001b:  stloc.0
      IL_001c:  ldloc.1
      IL_001d:  ldc.i4.1
      IL_001e:  add
      IL_001f:  stloc.1
      IL_0020:  ldloc.1
      IL_0021:  ldc.i4     0x186a1
      IL_0026:  blt.s      IL_000b

      IL_0028:  ret
    } // end of method Benchm2::Rec

Rec_f:

.method public hidebysig instance void 
            Rec_fByRef() cil managed
    {
      .custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 00 00 ) 
      // Code size       41 (0x29)
      .maxstack  4
      .locals init (valuetype Program/SingleRec V_0,
               int32 V_1,
               valuetype Program/SingleRec& V_2)
      IL_0000:  ldc.i4.0
      IL_0001:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_0006:  stloc.0
      IL_0007:  ldc.i4.1
      IL_0008:  stloc.1
      IL_0009:  br.s       IL_0020

      IL_000b:  ldloca.s   V_0
      IL_000d:  stloc.2
      IL_000e:  ldloc.2
      IL_000f:  ldfld      int32 Program/SingleRec::value@
      IL_0014:  ldc.i4.1
      IL_0015:  add
      IL_0016:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_001b:  stloc.0
      IL_001c:  ldloc.1
      IL_001d:  ldc.i4.1
      IL_001e:  add
      IL_001f:  stloc.1
      IL_0020:  ldloc.1
      IL_0021:  ldc.i4     0x186a1
      IL_0026:  blt.s      IL_000b

      IL_0028:  ret
    } // end of method Benchm2::Rec_fByRef

Rec_fByRef:

 .method public hidebysig instance void 
            Rec_fByRef() cil managed
    {
      .custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 00 00 ) 
      // Code size       41 (0x29)
      .maxstack  4
      .locals init (valuetype Program/SingleRec V_0,
               int32 V_1,
               valuetype Program/SingleRec& V_2)
      IL_0000:  ldc.i4.0
      IL_0001:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_0006:  stloc.0
      IL_0007:  ldc.i4.1
      IL_0008:  stloc.1
      IL_0009:  br.s       IL_0020

      IL_000b:  ldloca.s   V_0
      IL_000d:  stloc.2
      IL_000e:  ldloc.2
      IL_000f:  ldfld      int32 Program/SingleRec::value@
      IL_0014:  ldc.i4.1
      IL_0015:  add
      IL_0016:  newobj     instance void Program/SingleRec::.ctor(int32)
      IL_001b:  stloc.0
      IL_001c:  ldloc.1
      IL_001d:  ldc.i4.1
      IL_001e:  add
      IL_001f:  stloc.1
      IL_0020:  ldloc.1
      IL_0021:  ldc.i4     0x186a1
      IL_0026:  blt.s      IL_000b

      IL_0028:  ret
    } // end of method Benchm2::Rec_fByRef

The updated benchmark is here:

visualfsharp_5136_2.zip

``

The performance fluctuation seems too large (~50%) for method and loop alignment, but I did not yet investigated it.

zpodlovics · 2018-06-08T19:46:00Z

Well it seems that #2688 is really similar or the same to the issue here: fsharp/fslang-design#287 (comment) and the response here: fsharp/fslang-design#287 (comment)

"Basically we need to rejig the pattern match compiler in F# so it can accept an LValue as its starting point rather than RValue. It's fairly conceptually straightforward but needs care."

dsyme · 2018-06-08T19:53:26Z

@zpodlovics Yes, it's the same issue

dsyme · 2018-06-08T20:08:25Z

https://github.com/Microsoft/visualfsharp/compare/master...dsyme:fix-5136?expand=1 is a quick draft of a possible fix for

{ x with ... } respecting byrefs
&x is a constant value even if x is mutable and can be propagated (so let y = &x in ... y ... the occurrences of y get replaced by &x and y gets removed)

This actually fixes the perf problems in this thread (I think the perf improvements are from the second though in other cases the first will help a lot as well)

   Method |     Mean |     Error |    StdDev | Scaled | ScaledSD |
--------- |---------:|----------:|----------:|-------:|---------:|
   Struct | 29.14 us | 0.5042 us | 0.4716 us |   0.98 |     0.03 |
 Struct_f | 29.69 us | 0.5893 us | 0.7866 us |   1.00 |     0.04 |
    Union | 29.26 us | 0.3893 us | 0.3451 us |   0.98 |     0.03 |
  Union_f | 29.27 us | 0.6739 us | 0.6921 us |   0.98 |     0.03 |
      Rec | 29.12 us | 0.3752 us | 0.3510 us |   0.98 |     0.03 |
    Rec_f | 29.57 us | 0.5717 us | 0.7230 us |   0.99 |     0.04 |
      Int | 29.77 us | 0.5904 us | 0.8081 us |   1.00 |     0.00 |

thinkbeforecoding · 2018-06-08T21:23:16Z

Awsome !
I still have a lot to learn about compiler internals to be able to fix this kind of things 😁

abelbraaksma · 2018-06-09T11:32:07Z

@abelbraaksma Ah no, I don't think that's the cause of the difference between Rec and Rec_f after all. It's still a good thing to fix however

You probably meant @zpodlovics? Though I'm honored ;)

cartermp · 2018-09-20T04:31:03Z

Closing as #5148 is merged.

thinkbeforecoding changed the title ~~IL Generation: struct record extra copy~~ IL Generation: struct and struct records methods are slower Jun 8, 2018

dsyme added Tenet-Performance Feature Improvement labels Jun 8, 2018

dsyme mentioned this issue Jun 8, 2018

[CompilerPerf] Fix 5136: LAddrOf is a constant and {x with ...} respects byrefness of x #5148

Merged

dsyme added the Ready label Jun 11, 2018

cartermp added this to the Unknown milestone Aug 25, 2018

cartermp modified the milestones: Unknown, 16.0 Sep 20, 2018

cartermp closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IL Generation: struct and struct records methods are slower #5136

IL Generation: struct and struct records methods are slower #5136

thinkbeforecoding commented Jun 7, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018

thinkbeforecoding commented Jun 8, 2018

thinkbeforecoding commented Jun 8, 2018

zpodlovics commented Jun 8, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018

dsyme commented Jun 8, 2018

dsyme commented Jun 8, 2018 •

edited

Loading

dsyme commented Jun 8, 2018

zpodlovics commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

dsyme commented Jun 8, 2018

dsyme commented Jun 8, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018

abelbraaksma commented Jun 9, 2018

cartermp commented Sep 20, 2018

IL Generation: struct and struct records methods are slower #5136

IL Generation: struct and struct records methods are slower #5136

Comments

thinkbeforecoding commented Jun 7, 2018 • edited Loading

Repro steps

Expected behavior

Actual behavior

Known workarounds

Related information

thinkbeforecoding commented Jun 8, 2018 • edited Loading

thinkbeforecoding commented Jun 8, 2018

thinkbeforecoding commented Jun 8, 2018

thinkbeforecoding commented Jun 8, 2018

zpodlovics commented Jun 8, 2018 • edited Loading

thinkbeforecoding commented Jun 8, 2018 • edited Loading

zpodlovics commented Jun 8, 2018 • edited Loading

zpodlovics commented Jun 8, 2018

dsyme commented Jun 8, 2018

dsyme commented Jun 8, 2018 • edited Loading

dsyme commented Jun 8, 2018

zpodlovics commented Jun 8, 2018 • edited Loading

zpodlovics commented Jun 8, 2018 • edited Loading

dsyme commented Jun 8, 2018

dsyme commented Jun 8, 2018 • edited Loading

thinkbeforecoding commented Jun 8, 2018

abelbraaksma commented Jun 9, 2018

cartermp commented Sep 20, 2018

thinkbeforecoding commented Jun 7, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

thinkbeforecoding commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

dsyme commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

zpodlovics commented Jun 8, 2018 •

edited

Loading

dsyme commented Jun 8, 2018 •

edited

Loading