Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EgorBot for AndyAyersMS in #109209 #137

Open
EgorBot opened this issue Oct 26, 2024 · 7 comments
Open

EgorBot for AndyAyersMS in #109209 #137

EgorBot opened this issue Oct 26, 2024 · 7 comments

Comments

@EgorBot
Copy link
Owner

EgorBot commented Oct 26, 2024

Processing dotnet/runtime#109209 (comment) command:

Command

-intel -arm64 -profiler --envvars DOTNET_JitDisasm:TestInner

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    static string[] Data = new string[512];

    [Benchmark]
    public int Test() => TestInner(Data);

    [MethodImpl(MethodImplOptions.NoInlining)]
    int TestInner(ICollection<string> c) => c.Count;
}

(EgorBot will reply in this issue)

@EgorBot
Copy link
Owner Author

EgorBot commented Oct 26, 2024

Benchmark results on Arm64

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Arm64
  Job-MELZVS : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-YBPLAN : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_JitDisasm=TestInner
Method Toolchain Mean Error Ratio
Test Main 1.025 ns 0.0006 ns 1.00
Test PR 1.082 ns 0.0007 ns 1.06

BDN_Artifacts.zip

Profile for Bench_Test:

Flame graphs: Main vs PR 🔥
Speedscope: Main vs PR
Hot asm: Main vs PR
Hot functions: Main vs PR
Counters: Main vs PR

@EgorBot
Copy link
Owner Author

EgorBot commented Oct 26, 2024

cc @AndyAyersMS (logs)

@EgorBot
Copy link
Owner Author

EgorBot commented Oct 26, 2024

Benchmark results on Intel

BenchmarkDotNet v0.14.0, Ubuntu 24.04 LTS (Noble Numbat)
Intel Xeon Platinum 8488C, 1 CPU, 8 logical and 4 physical cores
  Job-FXRMOH : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-TQFOQG : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_JitDisasm=TestInner
Method Toolchain Mean Error Ratio
Test Main 4.0075 ns 0.0021 ns 1.00
Test PR 0.6158 ns 0.0013 ns 0.15

BDN_Artifacts.zip

Profile for Bench_Test:

Flame graphs: Main vs PR 🔥
Speedscope: Main vs PR
Hot asm: Main vs PR
Hot functions: Main vs PR
Counters: Main vs PR

@EgorBot
Copy link
Owner Author

EgorBot commented Oct 26, 2024

cc @AndyAyersMS (logs)

@EgorBo
Copy link

EgorBo commented Oct 26, 2024

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?, JitDisasm.asm output (see BDN_Artifacts.zip) does show that PR has a different codegen:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            mov     x0, x1
            movz    x11, #0x5B0
; ............................... 32B boundary ...............................
            movk    x11, #0xB805 LSL #16
            movk    x11, #0xFAC8 LSL #32
            ldr     xip0, [x11]
            blr     xip0
 
G_M000_IG03:                ;; offset=0x0020
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
; Total bytes of code 40

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for generic ARM64 - Unix
; Tier1 code
; optimized code
; fp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
 
G_M000_IG02:                ;; offset=0x0008
            ldr     x0, [x1]
            movz    x11, #0x9088
; ............................... 32B boundary ...............................
            movk    x11, #0x1D1C LSL #16
            movk    x11, #0xE000 LSL #32
            cmp     x0, x11
            bne     G_M000_IG05
 
G_M000_IG03:                ;; offset=0x0020
            ldr     w0, [x1, #0x08]
 
G_M000_IG04:                ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
 
G_M000_IG05:                ;; offset=0x002C
            mov     x0, x1
; ............................... 32B boundary ...............................
            movz    x11, #0x5B0
            movk    x11, #0x1C02 LSL #16
            movk    x11, #0xE000 LSL #32
            ldr     xip0, [x11]
            blr     xip0
            b       G_M000_IG04
 
; Total bytes of code 72

@EgorBo
Copy link

EgorBo commented Oct 26, 2024

x64:

Main:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp
 
G_M000_IG02:                ;; offset=0x0004
       mov      rdi, rsi
       mov      r11, 0x79AD0B0605B0
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       nop      
 
G_M000_IG03:                ;; offset=0x0015
       pop      rbp
       ret      
 
; Total bytes of code 23

PR:

; Assembly listing for method Bench:TestInner(System.Collections.Generic.ICollection`1[System.String]):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Unix
; Tier1 code
; optimized code
; rbp based frame
; partially interruptible
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rbp
       mov      rbp, rsp
 
G_M000_IG02:                ;; offset=0x0004
       mov      rdi, 0x7714F98998B0
       cmp      qword ptr [rsi], rdi
       jne      SHORT G_M000_IG05
 
G_M000_IG03:                ;; offset=0x0013
       mov      eax, dword ptr [rsi+0x08]
 
G_M000_IG04:                ;; offset=0x0016
       pop      rbp
       ret      
 
G_M000_IG05:                ;; offset=0x0018
       mov      rdi, rsi
       mov      r11, 0x7714F88505B0
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 5) 32B boundary ...............................
       call     [r11]System.Collections.Generic.ICollection`1[System.__Canon]:get_Count():int:this
       jmp      SHORT G_M000_IG04
 
; Total bytes of code 42

@AndyAyersMS
Copy link

@AndyAyersMS not sure why perf is the same on arm64, perhaps GDV check is expensive on arm64?,

Yeah, seems like it might be the cost of forming the constant for the type.

Also interesting that we can't tail call ... need to investigate that. With the advent of CET/CFG tail calling is probably becoming more valuable than it used to be (one less return anyways).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants