Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PlaintextMVC benchmark is slow on arm64 #60166

Closed
EgorBo opened this issue Oct 8, 2021 · 59 comments
Closed

PlaintextMVC benchmark is slow on arm64 #60166

EgorBo opened this issue Oct 8, 2021 · 59 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Oct 8, 2021

image

Plaintext-PlaintextMVC benchmark should benefit a lot from PGO (namely, from guarded devirtualizations and inlining) and it does benefit from it on all x64 platforms (Linux, Windows, Intel, Amd, etc) - up to +40% more RPS. Unfortunately, it's not the case for arm64 where there is no difference between DynamicPGO and Default. Moreover, the benchmark is 7-8x times slower on arm64 in comparison with x64-dynamicpgo (while I'd expect it to be 1.5-2x slower only).

It looks to me that on arm64 it's bound to JIT_New:
image

while on x64 it looks like this:
image

Namely, this call-site ("drilled") (https://github.com/dotnet/aspnetcore/blob/a0b950fc51c43289ab3c6dbea15926d32f3556cc/src/Mvc/Mvc.Core/src/Routing/ControllerRequestDelegateFactory.cs#L68-L101):
Arm64:
image

same call-site on x64:
image

Flamegraph for arm64 (two JIT_New are highlighted) for the first th:
image

x64:
image

Does it ring a bell to anyone (e.g. JIT_NewS_MP_FastPortable is not used, some gc feature is not implemented for arm64, some allocation tracker/profiler is enabled, etc.)?
/cc @dotnet/jit-contrib @Maoni0 @jkotas @davidwrighton
I can re-run the benchmark with any modifications in jit/vm/gc you suggest.

Steps to get the native traces:

Arm64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-arm --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

x64-Linux:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/plaintext.benchmarks.yml --scenario mvc --profile aspnet-citrine-lin --application.collect true --application.collectStartup false --application.options.collectCounters false   --application.dotnetTrace false --application.framework net6.0 --application.environmentVariables DOTNET_TieredPGO=1  --application.environmentVariables DOTNET_TC_QuickJitForLoops=1  --application.environmentVariables DOTNET_ReadyToRun=0

Powerbi link: https://aka.ms/aspnet/benchmarks (open "PGO" page there)

@EgorBo EgorBo added the tenet-performance Performance related issue label Oct 8, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 8, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

A similar benchmark "Plaintex-Plaintext" looks much better:
image

@stephentoub
Copy link
Member

stephentoub commented Oct 8, 2021

Does it ring a bell to anyone (e.g. slow jit helper is used, some gc feature is not implemented for arm64, etc)?

Do we know that is actually slower as opposed to maybe something else is causing that lambda to be invoked many more times?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

Does it ring a bell to anyone (e.g. slow jit helper is used, some gc feature is not implemented for arm64, etc)?

Do we know that is actually slower as opposed to maybe something else is causing that lambda to be invoked many more times?

Not sure, but I'd expect codepaths to be exactly the same, btw, here is a comparison for all benchmarks we run PGO for:
image
image

Exactly the same anomalies in JsonMVC, Foruntes and FortunesEF where arm64 is 5-7x times slower.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

image

^ Most of the benchmarks get a nice boost from DynamicPGO (to be more precise - from FullPGO where we disable R2R and enable QuickJitForLoops) on x64, but I don't see the same boost on arm64 for Fortunes, FortunesEF, JsonMVC and PlaintextMVC - and I suspect that it's because of "bound to JIT_New".

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

I don't know the exact arm64 hardware we have, but NodeJS benchmarks show around 50-100% difference between arm64 and x64 so I guess we can expect the same ratio on .NET.
image

JsonMvc and Plaintext benchmarks on NodeJS:
image

It'd be nice if we could run Actix (Rust) benchmarks on arm64 too - it should show a more precise ratio due to the fact it uses LLVM which is heavily optimized for both x64 and arm64.
image

Currently, we only run Actix for x64.

cc @sebastienros

@sebastienros
Copy link
Member

AFAIR the Docker image defined in TE's repo for Actix is not working for ARM64. Will try again.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

AFAIR the Docker image defined in TE's repo for Actix is not working for ARM64. Will try again.

It's not necessary to run Actix but anything fast and other than .NET on both x64 and arm64 just to feel the difference

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 8, 2021
@jkotas
Copy link
Member

jkotas commented Oct 8, 2021

Could you please take a look at the GC info tab in perf-view that shows the GC stats, and share it here for x64 vs. arm64?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

@jkotas while I'm re-opening the traces (it takes quite some time on my PC) here are the perf counters (--application.options.collectCounters true):
image

@sebastienros
Copy link
Member

NB: these are markdown formatted and also correctly spaced so you can either paste them as-is or wrap them in triple back ticks.

Stating the obvious:
Some x64 numbers are multiples of arm64 because RPS is just higher, e.g. Allocations, TP, so nothing weird here.
But the GC is abnormally high on arm64: 66% cpu and 59 gen 0 collections per second and the Gen 0 size significantly higher too.

Also can you use --chart, this will show if the max GC (66%) is just a small period of time, of if it's for the whole run.

@jkotas
Copy link
Member

jkotas commented Oct 8, 2021

I think the GC is clearly part of the problem. Gen0 GC are firing 4x more frequently per second and throughput is 8x lower. It means that the GC rate is effectively 30x higher. @dotnet/gc should take a look.

@jkotas
Copy link
Member

jkotas commented Oct 8, 2021

Notice that the counters show Gen0 size of 438,360 vs. 672. That is off a lot...

@mangod9
Copy link
Member

mangod9 commented Oct 8, 2021

would be good to figure out the computed Gen0 budget for this arm64 machine.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

Here a comparison between Plaintext-MVC and Plaintext-Plaintext benchmarks on arm64 (with --graph):
image

not sure why gc stats are empty for the Plaintext-plaintext one (it's zero alloc?)

@sebastienros
Copy link
Member

Plaintext is not allocating and never triggers a GC so you don't get any stats.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

It explains why it's so fast then 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Oct 8, 2021

would be good to figure out the computed Gen0 budget for this arm64 machine.

let me know if you want me to check it (but I need some pointers what to look at), I can modify gc and send it via crank there

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 8, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Oct 8, 2021
@Maoni0
Copy link
Member

Maoni0 commented Oct 11, 2021

is it possible to capture a GCCollectOnly trace? it's described here.

@sebastienros
Copy link
Member

@EgorBo to get the trace as requested by @Maoni0 you will need to switch to a managed trace since it's running on Linux.

Use --application.dotnetTrace true --application.dotnetTraceProviders gc-collect instead of --application.collect.

Also you might need to reduce the length of the benchmark with --variable warmup=2 --variable duration=5 for instance. Just in case the trace is too big.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 11, 2021

I sent these traces via zip to CLR GC Core just in case.

@Maoni0
Copy link
Member

Maoni0 commented Oct 11, 2021

with the data from @EgorBo and @sebastienros I got to the culprit. on arm64 we are reading the cache size this way

        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index0/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index1/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index2/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index3/size", &size))
            cacheSize = std::max(cacheSize, size);
        if(ReadMemoryValueFromFile("/sys/devices/system/cpu/cpu0/cache/index4/size", &size))
            cacheSize = std::max(cacheSize, size);

and on this particular arm64 machine there's no entry for the L3 cache (it only has index0/1 which are for data/instruction L1 cache size and index2 which is for L2 cache). and since we take the largest which is the L2 cache size, it's 256k, we return 2x that which is 768k and gen0 min budget is calculated as 5/8 of this which is 480k which is of course tiny.

if folks know of a way to get the L3 cache size programmatically on linux in this case I'm all ears.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 11, 2021

Duplicating here:
When I hard-coded GetLogicalProcessorCacheSizeFromOS to always return 32*1024*1024 and copied my locally built clrgc/coreclr via crank - I got 4X better RPS for this specific benchmark.

image
(32*1024*1024 is on the right)

@BruceForstall
Copy link
Member

if folks know of a way to get the L3 cache size programmatically on linux in this case I'm all ears.

@janvorli ?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 13, 2021

Full report (default vs COMPlus_GCgen0size=1E00000):
image

It compares RPS and P90 latency (ms) for both Default JIT mode and "Full PGO"

@sebastienros
Copy link
Member

@EgorBo based on how many benchmarks you must have run, I believe I should ask for some extra fans to be installed in the machine

@kunalspathak
Copy link
Member

I did a run today for Json mvc benchmark . For arm64, I collected using DOTNET_GCgen0size=1E00000 . Below is the comparison of x64 vs. arm64:

image

The max working set, GC Heap size and size of committed memory by GC is almost 2.5X on arm64. At the same time, Gen2 and LOH sizes are almost 3x less on arm64.

If I reading the gc-collect trace, for arm64, we just did 2 rounds of GC in arm64 vs. 17 rounds for x64.
Here is the arm64 GC info:

image

Here is the x64 GC info:

image

I also noticed the thread pool queue length and items/seconds are 2X smaller on arm64.

@mangod9 or @Maoni0 - any idea why and what should be the next steps to investigate?

@mangod9
Copy link
Member

mangod9 commented Dec 17, 2021

Are there differences in available memory on the two machines? We will need to compare profiles across the two to investigate further.

@kunalspathak
Copy link
Member

@sebastienros - could you confirm?

Are there differences in available memory on the two machines? We will need to compare profiles across the two to investigate further.

@sebastienros - could you please confirm?

I also noticed the thread pool queue length and items/seconds are 2X smaller on arm64.

x64 is 28 cores vs. arm64 is 32 cores. With that, what can we conclude about thread pool queue length and items/second differences?

@sebastienros
Copy link
Member

ARM:

ThinkSystem HR330A
1x 32-Core/3.0GHz eMAG CPU
64GB DDR4 (8x8GB)

INTEL:

Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2195 Mhz
14 Core(s), 28 Logical Processor(s)
32GB

Also a reminder that this ARM machine doesn't report L3 the way dotnet is reading it.

@mangod9
Copy link
Member

mangod9 commented Dec 17, 2021

The higher memory for ARM probably explains the low GC count. Assume the container running the test can be restricted to 32g for a better comparison?

Also a reminder that this ARM machine doesn't report L3 the way dotnet is reading it.

Yeah we are aware of that issue, hence Kunal is testing with DOTNET_GCgen0size=1E00000

@kunalspathak
Copy link
Member

The higher memory for ARM probably explains the low GC count. Assume the container running the test can be restricted to 32g for a better comparison?

Yes, that will give us better apples to apples comparison.

@kunalspathak
Copy link
Member

kunalspathak commented Dec 17, 2021

@sebastienros suggested to pass arguments to limit cpu/memory, so I used --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7 (Memory ~16GB and 8 cores), for x64 and arm64, but that doesn't change things. I still see mismatch in GC/threadpool.

image

@jkotas
Copy link
Member

jkotas commented Dec 17, 2021

You may want to run some basic GC and threadpool perf tests to isolate the problem, there may be more than one. For example - what is the allocation rate (B/sec) for a test like this:

using System.Threading;

int cpus = Environment.ProcessorCount;
for (int i = 0; i < cpus; i++)
   new Thread(() => { for(;;) GC.KeepAlive(new object()); });

Thread.Sleep(20_000);
Environment.Exit(0);

@kunalspathak
Copy link
Member

If I am reading the traces, it seems that GC pause time on arm64 is 5X than x64.

Both scenarios were ran using flags
--application.dotnetTrace true --application.dotnetTraceProviders gc-collect --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7.

x64:

image

arm64: (in addition to above flags, also added -application.environmentVariables DOTNET_GCgen0size=1E00000)

image

@Maoni0 - can you suggest any further steps to track this down?

@Maoni0
Copy link
Member

Maoni0 commented Jan 3, 2022

does it also collect more? what's the survived bytes for these GCs? you could also share the traces.

@kunalspathak
Copy link
Member

does it also collect more? what's the survived bytes for these GCs? you could also share the traces.

I will email them to you.

@kunalspathak
Copy link
Member

Looking at the traces with @Maoni0, there are some unusual callstack present in StackWalker code that does more work in arm64 compared to x64:

arm64:

image

x64:

image

@mangod9 - Do you know what would make arm64 do more work than x64 in stack walking code?

@mangod9
Copy link
Member

mangod9 commented Jan 3, 2022

@AntonLapounov @janvorli in case they have seen this before. Could you please share traces with us?

@kunalspathak
Copy link
Member

Could you please share traces with us?

Done over the email.

@kunalspathak
Copy link
Member

Update:

On @Maoni0 suggestion, we came up with a concentrated repro without the needing of threadpool/thread.sleep. With that we can see the slowness in allocation rate and regression in max pause time.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApp15
{
    public class GC_Test
    {
        static int depth = 25;
        public class Node
        {
            public Node left;
            public Node right;
        }
        static void Populate(int iDepth, Node thisNode)
        {
            if (iDepth <= 0)
                return;
            else
            {
                iDepth--;
                thisNode.left = new Node();
                thisNode.right = new Node();
                Populate(iDepth, thisNode.left);
                Populate(iDepth, thisNode.right);
            }
        }

        public static void Main(string[] args)
        {
            Console.WriteLine("Started");
            var stopwatch = Stopwatch.StartNew();
            Node root = new Node();
            Populate(depth, root);
            for (int i = 0; i < 100; i++)
            {
                GC.Collect();
            }
            Console.WriteLine($"Finished in {stopwatch.ElapsedMilliseconds}ms");
        }
    }
}
application x64_gc_collectCounters arm64_gc_collectCounters Diff Comments
Execution Time 10334msec 28873msec +178% <--
CPU Usage (%) 4 4 0.00%
Cores usage (%) 112 143 +27.68%
Working Set (MB) 449 558 +24.28%
Private Memory (MB) 1,318 1,413 +7.21%
Build Time (ms) 1,913 5,885 +207.63%
Start Time (ms) 44 98 +122.73%
Published Size (KB) 70,441 77,501 +10.02%
.NET Core SDK Version 6.0.101 6.0.101
Max CPU Usage (%) 14 17 +21.43%
Max Working Set (MB) 520 585 +12.50%
Max GC Heap Size (MB) 453 501 +10.60%
Size of committed memory by the GC (MB) 454 503 +10.79%
Max Number of Gen 0 GCs / sec 26.00 5.00 -80.77%
Max Number of Gen 1 GCs / sec 14.00 3.00 -78.57%
Max Number of Gen 2 GCs / sec 2.00 1.00 -50.00%
Max Time in GC (%) 86.00 99.00 +15.12%
Max Gen 0 Size (B) 24 24 0.00%
Max Gen 1 Size (B) 20,185,184 62,914,656 +211.69%
Max Gen 2 Size (B) 433,981,480 471,853,312 +8.73%
Max LOH Size (B) 24 24 0.00%
Max Allocation Rate (B/sec) 245,607,168 115,520,480 -52.97% <--
Max GC Heap Fragmentation 0 0 +25.45%
# of Assemblies Loaded 6 6 0.00%
Max Exceptions (#/s) 0 0
Max Lock Contention (#/s) 0 0
Max ThreadPool Threads Count 0 0
Max ThreadPool Queue Length 0 0
Max ThreadPool Items (#/s) 0 0
Max Active Timers 0 0
IL Jitted (B) 11,984 21,655 +80.70% <-- why different?
Methods Jitted 69 155 +124.64% <-- why different?

Below, the screenshots are for GCStats and call stack for both x64 vs. arm64. Some observations:

  • The execution time and GC allocation rate is 2X slower on Arm64
  • There are 34 more GCs that happened in x64 compared to arm64 in addition to the 100 GCs that were induced using GC.Collect(). I am assuming this is because of overall speed difference. On x64, code is executing faster leading to more allocation of objects and hence more frequently GC is triggered. On arm64, the slowness lead to lower allocation and hence lower no. of GCs.
  • Alike previous data, even in this short repro, we can see the "Max Pause Time" is 4X more on Arm64 than on x64.
  • Looking into the callstack screenshots, we can see that the plan_phase and mark_phase are the biggest contributor of the diff. In plan_phase, even the simple introsort_loop takes longer to execute. My initial assumption was the difference in the code produced by LLVM (which we use to build linux x64/arm64). I tried to extract the introsort_loop (and surrounding code) and checked the diff in godbolt, but I don't see anything that would stand out.
  • In mark_phase, I can see a clear difference in mark_object_simple1, but I don't see why it would be the case. May be there is huge difference in the count of data both platforms operate on?
  • Lastly, as seen in above table, I see the no. of methods/IL Jitted is different. I will add more traces to see why that is the case.

x64:

image

image

arm64:

image

image

@jkotas
Copy link
Member

jkotas commented Jan 15, 2022

The execution time and GC allocation rate is 2X slower on Arm64

One of the factors that influences the GC allocation rate is memory bandwidth. It may be a good idea to measure the memory bandwidth on the two machines: Write a simple micro-benchmark that clears 100MB buffer (no GC involved).

There are 34 more GCs that happened in x64 compared to arm64 in addition to the 100 GCs that were induced using GC.Collect(). I am assuming this is because of overall speed difference. On x64, code is executing faster leading to more allocation of objects and hence more frequently GC is triggered.

I do not think that this explanation makes sense. The benchmark you have shared allocates same amount on both arm64 and x64. I think that the different number of GCs between arm64 and x64 is caused by GC setting the allocation budgets differently. The allocation budgets are computed from the cache size, so the difference is most likely caused by the different (or incorrectly computed) cache sizes between arm64 and x64. There was a discussion about the buggy cache size computation earlier in this thread. Are you using any env. variables to compensate for it?

@kunalspathak
Copy link
Member

There was a discussion about the buggy cache size computation earlier in this thread. Are you using any env. variables to compensate for it?

Yes. I am using the environment variable on arm64 to set the cache size. I am also setting the memory and CPU same for x64/arm64.

--application.environmentVariables DOTNET_GCgen0size=1E00000 --application.memoryLimitInBytes 17179869184 --application.cpuSet 0-7.

@AntonLapounov
Copy link
Member

@kunalspathak How is it possible that Max Gen 0 Size is only 24 bytes for your small repro and only 192 bytes for the benchmark on ARM64?

@kunalspathak
Copy link
Member

@kunalspathak How is it possible that Max Gen 0 Size is only 24 bytes for your small repro and only 192 bytes for the benchmark on ARM64?

Honestly I am not sure how those numbers are collected and reported. Currently I am working with @Maoni0 to gather traces for https://github.com/dotnet/performance/tree/main/src/benchmarks/gc/src/exec/GCPerfSim.

@Maoni0
Copy link
Member

Maoni0 commented Feb 2, 2022

@AntonLapounov that's just the generation size after a GC, if there's no pinning, it could easily be 24 bytes (which is just a min object size)

@EgorBo
Copy link
Member Author

EgorBo commented Jul 6, 2022

Closing as completed

@EgorBo EgorBo closed this as completed Jul 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests