[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

kotlarmilos · 2023-12-08T15:36:46Z

Motivation and Background

The Mono AOT compiler currently requires a minimum of 16GB of RAM memory to compile System.Private.CoreLib.dll using a linux machine. Currently, this limitation is preventing us from running fullAOT tests in our CI. The purpose of this issue is to explore ways to reduce the RAM consumption of the AOT compiler.

Analysis

We conducted a simple experiment within free -m command inside a docker container using cbl-mariner-2.0-cross-amd64 on a machine with 32GB of RAM memory. The graph below presents the RAM memory consumption during consecutive AOT compilation for runningmono.dll and System.Private.CoreLib.dll assemblies using LLVM configuration in release mode.

The graph indicates a clear pattern of memory consumption during the AOT compilation of these two assemblies. The memory consumption for compiling runningmono.dll is smaller as it is less complex, but the pattern is the same. However, there are two noticeable spikes that are worth investigating.

runningmono.dll:
JIT time: 6518 ms
Generation time: 47876 ms
Assembly+Link time: 127 ms

System.Private.CoreLib.dll:
JIT time: 53617 ms
Generation time: 388484 ms
Assembly+Link time: 2845 ms

The first spike, which is likely related to the JIT phase, does not have a constant steepness. This irregularity may indicate a potential issue with memory handling during this phase. The second spike likely corresponds to the generation phase, and it appears to be consistent. By addressing these spikes, we aim to reduce the RAM usage of the AOT compiler and improve its efficiency.

Tasks

Correlate spikes with the code to identify any memory-related problems
Use valgrind to search for any potential memory leaks

/cc: @vargaz @lambdageek @BrzVlad @ivanpovazan @fanyang-mono @matouskozak @SamMonoRT @steveisok

The text was updated successfully, but these errors were encountered:

steveisok · 2023-12-08T16:06:56Z

@lateralusX when you're ready, can you share your findings with @kotlarmilos ?

vargaz · 2023-12-08T16:26:46Z

Memory usage in the AOT compiler consists of:

memory allocated by the JIT during compilation. Most of this is freed, but some might remain.
in llvm mode, the llvm bitcode module itself which is built in memory before being saved to disk. This is also freed, but the memory is not returned to the OS.
the memory used by opt/llc during compilation. Since these are executed by the aot compiler, the aot compiler memory usage is added to this.

Some ideas for improvements:

running a small number of optimization passes on llvm functions just after they are compiled i.e. InstCombine might reduce the size of the initial llvm module so it takes up less memory.
running llc/opt outside the aot compiler i.e. from msbuild or from a driver process would make sure the memory usage doesn't add up.

lateralusX · 2023-12-08T16:58:01Z

Been looking at this for a couple of days:

There are some smaller memory leaks around the cross compiler we could fix, but its not substantial, but will fix since I found them.
We lack intrinsic support for Vector256/Vector512 and we don't run dead code elimination pass to remove a lot of the branches that should be removed and since .net8 uses much more intrinsic in S.P.C it will bloat methods, like SpanHelpers::NonPackedIndexOfAnyValueType ends up with a 8MB CFG on Mono side, and since we don't run dead code elimination on Mono side, all that goes into LLVM module. In the end, cross compiler will consume gigabytes of memory compiling S.P.C on .net8.
Since we can't run MONO_OPT_BRANCH on LLVM, we would need to run LLVM function passes after emitting method into LLVM to reduce as much code as possible as early as possible.
I have also been thinking about remove dependency between mono cross -> opt, llc, so we can get rid of LLVM module form memory in mono cross compiler when running the other tools. An alternative option is running opt and llc directly in memory since we already have the LLVM module loaded and available, I did something similar in a previous POC, in that case we ran opt directly in cross compiler.

I will fix the things I hit so far.

tannergooding · 2023-12-08T19:40:51Z

Big issue is that we lack intrinsic support for new Vector256/Vector512 and Avx512 meaning isSupported and isHardwareAccelerated is not exchanged to intrinsic

Could you help me to understand this issue? The APIs in question are recursive and so Mono "must" have existing handling for them, even if just to cause them to return false (otherwise the code would simply not work and would overflow the stack). Because this handling must exist and can only return constant true or false today, it should be possible to eliminate it as dead code and so none of these code paths should be impacting the compilation for Mono. Even if this elimination happens as part of LLVM, I would expect LLVM to itself do the dead code elimination and simply not process these blocks at all.

Is the issue that LLVM is doing other optimizations first, rather than an early elimination of dead code blocks?

-- In RyuJIT we handle the general mapping to constant true/false here:

There is then a fallback for other special APIs, which may be unrecognized here:

RyuJIT then likewise does an early removal of dead code blocks as part of importation, to improve throughput for latter phases.

lateralusX · 2023-12-11T11:22:10Z

Correct, there are fallbacks for System.Runtime.Intrinsics.X86 classes IsSupported method to be false as well as fallback for System.Runtime.Intrinsics IsHardwareAccelerated to be false, so it will still "work" even if there are no specific implementation in runtime for these intrinsic.

We do miss support for intrinsic for Vector512<T>.IsSupported, so that will end up in a call, but will be optimized by LLVM opt later (but we will still generate code). We are also lack intrinsic support for X86.Avx512* and Vector256/Vector512 so that ends up being inlined and since most of these are marked always inline, methods will be bloated. I updated my initial comment with additional findings around fallback logic that is implemented on Mono.

Main issue is that we don't run regular JIT dead code elimination pass on the generated CFG when we use LLVM. In that case we rely on LLVM opt, but we only run that as an out of proc process after generating the full LLVM module (covering the whole assembly), so we end up generating all code, consume a lot of memory.

It would be great if we code do elimination of dead code blocks early in Mono as well, that would reduce the memory foot print when compiling assemblies like S.P.C.

I'm still investigating this issue, so I will find more details when I start to eliminate the initial detected issues.

lateralusX · 2023-12-14T11:05:20Z

Using unlinked S.P.C during investigation of potential fixes that could have significant impact on cross compiler memory size during this investigation.

Before any changes running a full AOT compile of S.P.C targeting x64 ended up with a cross compiler memory usage of ~6 GB in .net8. I have investigated and implemented the following fixes that will dramatically reduce the memory usage in this scenario targeting x64:

Fixed a number of memory leaks as part of compile_method, leading up to MB of leaked memory on large assemblies.
Added logic to run a couple selected LLVM function passes run per LLVM function. That pass will simplify the LLVM function as well as eliminate dead code.
Added explicit IsSupport and IsHardwareAccelerated for unsupported x86 intrinsic into simd-intrinsics.c, there are fallbacks that will handle this (or we would end up with recursion errors), but made things more explicit in code, making it easier to detect areas we are lacking and could improve going forward.
Added ILLink substitution for System.Runtime.Intrinsics not supported on Mono. This reduce the size of S.P.C dead code at build time, since ILLinker can now do a much better job eliminating dead code so bloated methods won't even hit cross compiler.
Added PNSE implementation for x86 intrinsics not supported on Mono.
Added logic to ignore aggressive inlining flag for intrinsics not supported by Mono in cross compiler, these method can still be inlined, but they will follow regular inline cost heuristics, meaning that they can still be inlined, but won't trigger the excessive inlining identified in some methods. Right now Vector256<T> and Vector512<T> types are handled this way since they are not supported on Mono and will most likely end up as dead code anyways in cases where they end up triggering excessive inlining.

With the above changes, compiling unlinked S.P.C ends up at ~1.3 GB of memory usage, so a rather dramatic improvement from original 6 GB.

I will also look into implementing a driver option in cross compiler as part of this effort. I will probably add a new driver option to aot compiler, that in turn will run aot compiler as a separate process using the asm-only + no_opt, it will then run opt, llc, asm as separate processes. That will make sure we release memory used by cross compiler before running LLVM tooling (opt/llc) and that should improve the scalability and parallelization on machines.

Still working on the fixes in a down stream repro, so will complete work there first and then upstream relevant changes.

…is resolved

…esolved (#96875) * Exclude System.Numerics.Tensors.Tests from wasm aot until #95791 is resolved * Also exclude System.Numerics.Tensors.Net8.Tests

lateralusX · 2024-01-17T13:19:47Z

PR reducing memory footprint, #97096. Will do changes to add driver mode to AOT compiler as a separate PR.

lateralusX · 2024-01-19T17:39:22Z

PR implementing driver mode in Mono AOT cross compiler, reduce machine memory usage per compiled assembly, not keeping cross compiler instance alive when running tools like opt and llc, #97226.

…is resolved (dotnet#96875) * Exclude System.Numerics.Tensors.Tests from wasm aot until dotnet#95791 is resolved * Also exclude System.Numerics.Tensors.Net8.Tests

kotlarmilos added the area-Codegen-AOT-mono label Dec 8, 2023

kotlarmilos added this to the 9.0.0 milestone Dec 8, 2023

kotlarmilos self-assigned this Dec 8, 2023

kotlarmilos mentioned this issue Dec 15, 2023

[mono][infra] Enable runtime tests in fullAOT LLVM mode on linux-x64 #92057

Merged

This was referenced Jan 10, 2024

[wasm] AOT: System.Numerics.Tensors.Tests - aot-instances.dll with exit code 137 #96631

Closed

[wasm] AOT: System.Numerics.Tensors.Net8.Tests - aot-instances.dll with exit code 137 #96865

Closed

tannergooding added a commit to tannergooding/runtime that referenced this issue Jan 12, 2024

Exclude System.Numerics.Tensors.Tests from wasm aot until dotnet#95791 …

5b1b22c

…is resolved

vargaz pushed a commit that referenced this issue Jan 12, 2024

Exclude System.Numerics.Tensors.Tests from wasm aot until #95791 is r…

38f5374

…esolved (#96875) * Exclude System.Numerics.Tensors.Tests from wasm aot until #95791 is resolved * Also exclude System.Numerics.Tensors.Net8.Tests

kotlarmilos mentioned this issue Jan 23, 2024

[mono] Enable NativeToManaged wrappers to get compiled with LLVM #96910

Closed

kotlarmilos mentioned this issue Mar 8, 2024

Re-enable arm64 and x64 Mono fullAOT llvm and mini jobs #90427

Open

7 tasks

kotlarmilos closed this as completed Apr 15, 2024

pavelsavara added a commit to pavelsavara/runtime that referenced this issue Apr 18, 2024

https://github.com/dotnet/runtime/issues/95791

c0ee5f2

pavelsavara mentioned this issue Apr 18, 2024

[browser] enable 95791 #101236

Merged

pavelsavara added a commit that referenced this issue Apr 18, 2024

[browser] enable #95791 (#101236)

11b53b2

matouskozak pushed a commit to matouskozak/runtime that referenced this issue Apr 30, 2024

[browser] enable dotnet#95791 (dotnet#101236)

52a8337

github-actions bot locked and limited conversation to collaborators May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

kotlarmilos commented Dec 8, 2023 •

edited by steveisok

Loading

steveisok commented Dec 8, 2023

vargaz commented Dec 8, 2023

lateralusX commented Dec 8, 2023 •

edited

Loading

tannergooding commented Dec 8, 2023

lateralusX commented Dec 11, 2023 •

edited

Loading

lateralusX commented Dec 14, 2023 •

edited

Loading

lateralusX commented Jan 17, 2024

lateralusX commented Jan 19, 2024 •

edited

Loading

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

Comments

kotlarmilos commented Dec 8, 2023 • edited by steveisok Loading

Motivation and Background

Analysis

Tasks

steveisok commented Dec 8, 2023

vargaz commented Dec 8, 2023

lateralusX commented Dec 8, 2023 • edited Loading

tannergooding commented Dec 8, 2023

lateralusX commented Dec 11, 2023 • edited Loading

lateralusX commented Dec 14, 2023 • edited Loading

lateralusX commented Jan 17, 2024

lateralusX commented Jan 19, 2024 • edited Loading

kotlarmilos commented Dec 8, 2023 •

edited by steveisok

Loading

lateralusX commented Dec 8, 2023 •

edited

Loading

lateralusX commented Dec 11, 2023 •

edited

Loading

lateralusX commented Dec 14, 2023 •

edited

Loading

lateralusX commented Jan 19, 2024 •

edited

Loading