Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressions in System.Collections.Tests.Perf_BitArray and System.Collections.IndexerSet<Int32> #66769

Closed
performanceautofiler bot opened this issue Mar 17, 2022 · 9 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI runtime-coreclr specific to the CoreCLR runtime
Milestone

Comments

@performanceautofiler
Copy link

performanceautofiler bot commented Mar 17, 2022

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 63fd977f7d57e0c0e434caecfff0dd0a07d2f27f
Compare 466b0f66aafe63a31696c16dca43a727cd4bbce5
Diff Diff

Regressions in System.Collections.Tests.Perf_BitArray

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
BitArraySetLengthShrink - Duration of single invocation 237.51 ns 257.46 ns 1.08 0.00 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Collections.Tests.Perf_BitArray*'

Payloads

Baseline
Compare

Histogram

System.Collections.Tests.Perf_BitArray.BitArraySetLengthShrink(Size: 512)


Description of detection logic

IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 257.45578083352854 > 250.07199194800694.
IsChangePoint: Marked as a change because one of 3/15/2022 12:53:39 AM, 3/17/2022 12:08:34 AM falls between 3/8/2022 5:51:58 AM and 3/17/2022 12:08:34 AM.
IsRegressionStdDev: Marked as regression because -11.78109322334737 (T) = (0 -258.6891193534734) / Math.Sqrt((8.595811911146967 / (36)) + (13.040673937399216 / (6))) is less than -2.0210753903043583 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (36) + (6) - 2, .025) and -0.07611578445829842 = (240.391529507853 - 258.6891193534734) / 240.391529507853 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

### Run Information
Architecture arm64
OS Windows 10.0.19041
Baseline 63fd977f7d57e0c0e434caecfff0dd0a07d2f27f
Compare 466b0f66aafe63a31696c16dca43a727cd4bbce5
Diff Diff

Regressions in System.Collections.IndexerSet<Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Span - Duration of single invocation 177.00 ns 187.44 ns 1.06 0.07 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Collections.IndexerSet&lt;Int32&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.IndexerSet<Int32>.Span(Size: 512)


Description of detection logic

IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 187.44322490494397 > 180.31185587212346.
IsChangePoint: Marked as a change because one of 3/15/2022 12:53:39 AM, 3/17/2022 12:08:34 AM falls between 3/8/2022 5:51:58 AM and 3/17/2022 12:08:34 AM.
IsRegressionStdDev: Marked as regression because -17.407749272978624 (T) = (0 -188.7349199381894) / Math.Sqrt((6.275661946704857 / (36)) + (4.068162474412199 / (6))) is less than -2.0210753903043583 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (36) + (6) - 2, .025) and -0.0930788703937596 = (172.66358819121757 - 188.7349199381894) / 172.66358819121757 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

category:performance
theme:benchmarks

@performanceautofiler performanceautofiler bot added arm64 untriaged New issue has not been triaged by the area owner labels Mar 17, 2022
@EgorBo EgorBo changed the title [Perf] Changes at 3/15/2022 4:36:24 AM Regressions in System.Collections.Tests.Perf_BitArray and System.Collections.IndexerSet<Int32> Mar 17, 2022
@EgorBo EgorBo transferred this issue from dotnet/perf-autofiling-issues Mar 17, 2022
@EgorBo
Copy link
Member

EgorBo commented Mar 17, 2022

Most likely #66618 cc @AndyAyersMS

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 18, 2022
@ghost
Copy link

ghost commented Mar 18, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture arm64
OS Windows 10.0.19041
Baseline 63fd977f7d57e0c0e434caecfff0dd0a07d2f27f
Compare 466b0f66aafe63a31696c16dca43a727cd4bbce5
Diff Diff

Regressions in System.Collections.Tests.Perf_BitArray

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
BitArraySetLengthShrink - Duration of single invocation 237.51 ns 257.46 ns 1.08 0.00 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Collections.Tests.Perf_BitArray*'

Payloads

Baseline
Compare

Histogram

System.Collections.Tests.Perf_BitArray.BitArraySetLengthShrink(Size: 512)


Description of detection logic

IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 257.45578083352854 > 250.07199194800694.
IsChangePoint: Marked as a change because one of 3/15/2022 12:53:39 AM, 3/17/2022 12:08:34 AM falls between 3/8/2022 5:51:58 AM and 3/17/2022 12:08:34 AM.
IsRegressionStdDev: Marked as regression because -11.78109322334737 (T) = (0 -258.6891193534734) / Math.Sqrt((8.595811911146967 / (36)) + (13.040673937399216 / (6))) is less than -2.0210753903043583 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (36) + (6) - 2, .025) and -0.07611578445829842 = (240.391529507853 - 258.6891193534734) / 240.391529507853 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

### Run Information
Architecture arm64
OS Windows 10.0.19041
Baseline 63fd977f7d57e0c0e434caecfff0dd0a07d2f27f
Compare 466b0f66aafe63a31696c16dca43a727cd4bbce5
Diff Diff

Regressions in System.Collections.IndexerSet<Int32>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Span - Duration of single invocation 177.00 ns 187.44 ns 1.06 0.07 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Collections.IndexerSet&lt;Int32&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.IndexerSet<Int32>.Span(Size: 512)


Description of detection logic

IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsRegressionBase: Marked as regression because the compare was 5% greater than the baseline, and the value was not too small.
IsRegressionChecked: Marked as regression because the three check build points were 0.05 greater than the baseline.
IsRegressionWindowed: Marked as regression because 187.44322490494397 > 180.31185587212346.
IsChangePoint: Marked as a change because one of 3/15/2022 12:53:39 AM, 3/17/2022 12:08:34 AM falls between 3/8/2022 5:51:58 AM and 3/17/2022 12:08:34 AM.
IsRegressionStdDev: Marked as regression because -17.407749272978624 (T) = (0 -188.7349199381894) / Math.Sqrt((6.275661946704857 / (36)) + (4.068162474412199 / (6))) is less than -2.0210753903043583 = MathNet.Numerics.Distributions.StudentT.InvCDF(0, 1, (36) + (6) - 2, .025) and -0.0930788703937596 = (172.66358819121757 - 188.7349199381894) / 172.66358819121757 is less than -0.05.
IsImprovementBase: Marked as not an improvement because the compare was not 5% less than the baseline, or the value was too small.
IsChangeEdgeDetector: Marked not as a regression because Edge Detector said so.

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged, refs/heads/main, RunKind=micro, Windows 10.0.19041, Regression, CoreClr, arm64

Milestone: -

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 21, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Mar 21, 2022
@am11 am11 added arch-arm64 and removed arm64 labels Mar 30, 2022
@AndyAyersMS
Copy link
Member

Both of these have persisted since
newplot - 2022-05-02T111634 558
newplot - 2022-05-02T111637 556

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 29, 2022

Local repros (base is 63fd977, diff is cd88b84)

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated Alloc Ratio
BitArraySetLengthShrink Job-PUHVHZ \base-rel\corerun.exe 512 138.76 ns 2.033 ns 1.901 ns 138.91 ns 135.67 ns 141.87 ns 1.00 0.00 0.0904 568 B 1.00
BitArraySetLengthShrink Job-JSGEQE \diff-rel\corerun.exe 512 217.63 ns 2.341 ns 2.076 ns 216.98 ns 214.36 ns 222.05 ns 1.57 0.03 0.0902 568 B 1.00
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
Span Job-JEUKJG \base-rel\corerun.exe 512 148.4 ns 1.23 ns 1.09 ns 148.3 ns 146.1 ns 150.1 ns 1.00 0.00 - NA
Span Job-EXQWVI \diff-rel\corerun.exe 512 149.2 ns 2.84 ns 2.66 ns 148.9 ns 146.1 ns 154.8 ns 1.00 0.02 - NA

So Span seems ok locally.

[Edit: actually the original report was vs arm64; the data just above & assembly below are for x64. Not sure it matters, but I'll take a look at arm64 too].

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 29, 2022

All the time in BitArraySetLengthShrink is in BitArray..ctor. This is dominated by its loop. There are two codegen diffs in the loop:

  • we no longer try and align the loop top (and so the jae ends up crossing a 32 byte boundary)
  • we miss a CSE

Both of these happen because there is some PGO data for bits and pieces of the loop body, and so we don't reweight the blocks. This leads to a failed weight check for alignment and a lower-weight CSE candidate.

G_M47086_IG03:  ;; offset=0070H                             G_M47086_IG03: ;; offset=0066H

4D8BCA          mov      r9, r10                            4D8BC1         mov      r8, r9
4C8BC7          mov      r8, rdi                            488BC7         mov      rax, rdi
83FB04          cmp      ebx, 4                             83FB04         cmp      ebx, 4
0F8C4A010000    jl       G_M47086_IG12                      0F8C4B010000   jl       G_M47086_IG12

458B00          mov      r8d, dword ptr [r8]                8B00           mov      eax, dword ptr [rax]
413B5108        cmp      edx, dword ptr [r9+8]              413B5008       cmp      edx, dword ptr [r8+8]
0F834E010000    jae      G_M47086_IG14                      0F8352010000   jae      G_M47086_IG14          ;; JCC

448BDA          mov      r11d, edx                          448BD2         mov      r10d, edx
4789449910      mov      dword ptr [r9+4*r11+16], r8d       4389449010     mov      dword ptr [r8+4*r10+16], eax
83FB04          cmp      ebx, 4                             83FB04         cmp      ebx, 4
0F8237010000    jb       G_M47086_IG13                      0F823A010000   jb       G_M47086_IG13

83C3FC          add      ebx, -4                            4883C704       add      rdi, 4
4883C704        add      rdi, 4                             83C3FC         add      ebx, -4
FFC2            inc      edx                                FFC2           inc      edx
4C63C2          movsxd   r8, edx                            4863C2         movsxd   rax, edx
                                                            448BC1         mov      r8d, ecx              ;; CSE
4C3BC0          cmp      r8, rax                            493BC0         cmp      rax, r8
7CC2            jl       SHORT G_M47086_IG03                7CC0           jl       SHORT G_M47086_IG03

Also interesting (but common to both versions) is that we don't hoist the array length fetch.

@AndyAyersMS
Copy link
Member

This is really an instance of a more general problem with the way the jit handles methods with partial PGO data. I don't think it is possible to fix this without some major revisions to this whole area, and that seems out of scope for .NET 7.

In particular:

  • when we combine two blocks if either is profiled we use the max weight for the combined block. In a partially profiled method this causes the effective "range" of PGO to grow. In particular if we have a non-pgo loop with a bit of pgo inside, the pgo range can grow to encompass the loop and prevent loop scale-up (more or less what happens in the above).
  • when we do general scale-down we scale down blocks with PGO
  • when we do loop scale-up we won't scale up blocks with PGO

All this needs to be rethought; in a mixed PGO method the incoming PGO data should be used to provide exit probabilities and we need a general algorithm to synthesize the same for non-PGO blocks and then from there deduce consistent block and edge counts.

Related work:

Going to move this to future.

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, Future Jun 29, 2022
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 29, 2022

Local arm64 data for same pair of commits as above. Not showing any serious regression.

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Allocated Alloc Ratio
BitArraySetLengthShrink Job-CCSRPV \base-rel\corerun.exe 512 135.7 ns 0.94 ns 0.88 ns 135.4 ns 134.4 ns 137.4 ns 1.00 0.1355 568 B 1.00
BitArraySetLengthShrink Job-UOKGVV \diff-rel\corerun.exe 512 139.5 ns 0.54 ns 0.48 ns 139.5 ns 138.7 ns 140.2 ns 1.03 0.1353 568 B 1.00

Disassembly shows similar artifacts as x64 -- no alignment, extra mov

;; BASE

G_M47086_IG03:              ;; offset=0080H
        AA0603E5          mov     x5, x6
        AA0103E4          mov     x4, x1
        710012BF          cmp     w21, #4
        54000B6B          blt     G_M47086_IG13
        B9400084          ldr     w4, [x4]
        B94008A7          ldr     w7, [x5,#8]
        6B07005F          cmp     w2, w7
        54000B82          bhs     G_M47086_IG15
; ............................... 32B boundary ...............................
        910040A5          add     x5, x5, #16
        B82258A4          str     w4, [x5, w2, UXTW #2]
        710012BF          cmp     w21, #4
        54000AC3          blo     G_M47086_IG14
        510012B5          sub     w21, w21, #4
        91001021          add     x1, x1, #4
        11000442          add     w2, w2, #1
        93407C44          sxtw    x4, w2
; ............................... 32B boundary ...............................
        EB03009F          cmp     x4, x3
        54FFFDEB          blt     G_M47086_IG03

;; DIFF
G_M47086_IG03:              ;; offset=0078H
        AA0503E4          mov     x4, x5
        AA0103E3          mov     x3, x1
; ............................... 32B boundary ...............................
        710012BF          cmp     w21, #4
        54000D6B          blt     G_M47086_IG12
        B9400063          ldr     w3, [x3]
        B9400886          ldr     w6, [x4,#8]
        6B06005F          cmp     w2, w6
        54000E82          bhs     G_M47086_IG14
        91004084          add     x4, x4, #16
        B8225883          str     w3, [x4, w2, UXTW #2]
; ............................... 32B boundary ...............................
        710012BF          cmp     w21, #4
        54000D43          blo     G_M47086_IG13
        91001021          add     x1, x1, #4
        510012B5          sub     w21, w21, #4
        11000442          add     w2, w2, #1
        93407C43          sxtw    x3, w2
        2A0003E4          mov     w4, w0    ;; CSE
        EB04007F          cmp     x3, x4
; ............................... 32B boundary ...............................
        54FFFDCB          blt     G_M47086_IG03                               

@JulieLeeMSFT JulieLeeMSFT modified the milestones: Future, 8.0.0 Oct 5, 2022
@AndyAyersMS
Copy link
Member

BitArray seems to have recovered since the above

newplot - 2022-11-01T080018 763

but not IndexerSet

newplot - 2022-11-01T080145 601

@jeffhandley jeffhandley added os-windows runtime-coreclr specific to the CoreCLR runtime and removed Windows 10.0.19041 labels Dec 28, 2022
@AndyAyersMS
Copy link
Member

Both have since improved and are as fast or faster than they were when this report was filed.

BitArray possibly from #81095, IndexerSet from #85275.

newplot - 2023-05-15T200013 433

newplot - 2023-05-15T195850 222

@ghost ghost locked as resolved and limited conversation to collaborators Jun 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI runtime-coreclr specific to the CoreCLR runtime
Projects
None yet
Development

No branches or pull requests

7 participants