EgorBot for AndyAyersMS in #109209 #137

EgorBot · 2024-10-26T15:36:28Z

Processing dotnet/runtime#109209 (comment) command:

Command

-intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static string[] Data = new string[512];

    [Benchmark]
    public int Test() => TestInner(Data);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<string> c) => c.Count;
}

(EgorBot will reply in this issue)

EgorBot · 2024-10-26T15:59:09Z

Benchmark results on `Arm64`

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Arm64
  Job-MELZVS : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-YBPLAN : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_JitDisasm=TestInner

Method	Toolchain	Mean	Error	Ratio
Test	Main	1.025 ns	0.0006 ns	1.00
Test	PR	1.082 ns	0.0007 ns	1.06

BDN_Artifacts.zip

Profile for `Bench_Test`:

Flame graphs: Main vs PR 🔥
Speedscope: Main vs PR
Hot asm: Main vs PR
Hot functions: Main vs PR
Counters: Main vs PR

EgorBot · 2024-10-26T15:59:10Z

cc @AndyAyersMS (logs)

EgorBot · 2024-10-26T16:02:01Z

Benchmark results on `Intel`

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
  Job-FXRMOH : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-TQFOQG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_JitDisasm=TestInner

Method	Toolchain	Mean	Error	Ratio
Test	Main	4.0075 ns	0.0021 ns	1.00
Test	PR	0.6158 ns	0.0013 ns	0.15

BDN_Artifacts.zip

Profile for `Bench_Test`:

Flame graphs: Main vs PR 🔥
Speedscope: Main vs PR
Hot asm: Main vs PR
Hot functions: Main vs PR
Counters: Main vs PR

EgorBot · 2024-10-26T16:02:02Z

cc @AndyAyersMS (logs)

EgorBo · 2024-10-26T16:02:56Z

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?, JitDisasm.asm output (see BDN_Artifacts.zip) does show that PR has a different codegen:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     x0, x1
            movz    x11, #0x5B0
; ............................... 32B boundary ...............................
            movk    x11, #0xB805 LSL #16
            movk    x11, #0xFAC8 LSL #32
            ldr     xip0, [x11]
            blr     xip0
 
G_M000_IG03:                ;; offset=0x0020
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 40

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldr     x0, [x1]
            movz    x11, #0x9088
; ............................... 32B boundary ...............................
            movk    x11, #0x1D1C LSL #16
            movk    x11, #0xE000 LSL #32
            cmp     x0, x11
            bne     G_M000_IG05
 
G_M000_IG03:                ;; offset=0x0020
            ldr     w0, [x1, #0x08]
 
G_M000_IG04:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
G_M000_IG05:                ;; offset=0x002C
            mov     x0, x1
; ............................... 32B boundary ...............................
            movz    x11, #0x5B0
            movk    x11, #0x1C02 LSL #16
            movk    x11, #0xE000 LSL #32
            ldr     xip0, [x11]
            blr     xip0
            b       G_M000_IG04
 
; Total bytes of code 72

EgorBo · 2024-10-26T16:07:03Z

x64:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp
 
G_M000_IG02:                ;; offset=0x0004
       mov      rdi, rsi
       mov      r11, 0x79AD0B0605B0
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       nop      
 
G_M000_IG03:                ;; offset=0x0015
       pop      rbp
       ret      
 
; Total bytes of code 23

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp
 
G_M000_IG02:                ;; offset=0x0004
       mov      rdi, 0x7714F98998B0
       cmp      qword ptr [rsi], rdi
       jne      SHORT G_M000_IG05
 
G_M000_IG03:                ;; offset=0x0013
       mov      eax, dword ptr [rsi+0x08]
 
G_M000_IG04:                ;; offset=0x0016
       pop      rbp
       ret      
 
G_M000_IG05:                ;; offset=0x0018
       mov      rdi, rsi
       mov      r11, 0x7714F88505B0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       jmp      SHORT G_M000_IG04
 
; Total bytes of code 42

AndyAyersMS · 2024-10-26T16:12:46Z

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?,

Yeah, seems like it might be the cost of forming the constant for the type.

Also interesting that we can't tail call ... need to investigate that. With the advent of CET/CFG tail calling is probably becoming more valuable than it used to be (one less return anyways).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EgorBot for AndyAyersMS in #109209 #137

EgorBot for AndyAyersMS in #109209 #137

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

EgorBo commented Oct 26, 2024 •

edited

Loading

EgorBo commented Oct 26, 2024

AndyAyersMS commented Oct 26, 2024

EgorBot for AndyAyersMS in #109209 #137

EgorBot for AndyAyersMS in #109209 #137

Comments

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

Benchmark results on Arm64

Profile for Bench_Test:

EgorBot commented Oct 26, 2024

EgorBot commented Oct 26, 2024

Benchmark results on Intel

Profile for Bench_Test:

EgorBot commented Oct 26, 2024

EgorBo commented Oct 26, 2024 • edited Loading

EgorBo commented Oct 26, 2024

AndyAyersMS commented Oct 26, 2024

Benchmark results on `Arm64`

Profile for `Bench_Test`:

Benchmark results on `Intel`

Profile for `Bench_Test`:

EgorBo commented Oct 26, 2024 •

edited

Loading