Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

Closed
1 of 2 tasks
Tracked by #90427
kotlarmilos opened this issue Dec 8, 2023 · 8 comments
Closed
1 of 2 tasks
Tracked by #90427

[mono][aot] Investigate RAM consumption in Mono AOT compiler #95791

kotlarmilos opened this issue Dec 8, 2023 · 8 comments
Assignees
Milestone

Comments

@kotlarmilos
Copy link
Member

kotlarmilos commented Dec 8, 2023

Motivation and Background

The Mono AOT compiler currently requires a minimum of 16GB of RAM memory to compile System.Private.CoreLib.dll using a linux machine. Currently, this limitation is preventing us from running fullAOT tests in our CI. The purpose of this issue is to explore ways to reduce the RAM consumption of the AOT compiler.

Analysis

We conducted a simple experiment within free -m command inside a docker container using cbl-mariner-2.0-cross-amd64 on a machine with 32GB of RAM memory. The graph below presents the RAM memory consumption during consecutive AOT compilation for runningmono.dll and System.Private.CoreLib.dll assemblies using LLVM configuration in release mode.

Blank diagram - Page 1

The graph indicates a clear pattern of memory consumption during the AOT compilation of these two assemblies. The memory consumption for compiling runningmono.dll is smaller as it is less complex, but the pattern is the same. However, there are two noticeable spikes that are worth investigating.

runningmono.dll:
JIT time: 6518 ms
Generation time: 47876 ms
Assembly+Link time: 127 ms

System.Private.CoreLib.dll:
JIT time: 53617 ms
Generation time: 388484 ms
Assembly+Link time: 2845 ms

The first spike, which is likely related to the JIT phase, does not have a constant steepness. This irregularity may indicate a potential issue with memory handling during this phase. The second spike likely corresponds to the generation phase, and it appears to be consistent. By addressing these spikes, we aim to reduce the RAM usage of the AOT compiler and improve its efficiency.

Tasks

  • Correlate spikes with the code to identify any memory-related problems
  • Use valgrind to search for any potential memory leaks

/cc: @vargaz @lambdageek @BrzVlad @ivanpovazan @fanyang-mono @matouskozak @SamMonoRT @steveisok

@kotlarmilos kotlarmilos added this to the 9.0.0 milestone Dec 8, 2023
@kotlarmilos kotlarmilos self-assigned this Dec 8, 2023
@steveisok
Copy link
Member

@lateralusX when you're ready, can you share your findings with @kotlarmilos ?

@vargaz
Copy link
Contributor

vargaz commented Dec 8, 2023

Memory usage in the AOT compiler consists of:

  • memory allocated by the JIT during compilation. Most of this is freed, but some might remain.
  • in llvm mode, the llvm bitcode module itself which is built in memory before being saved to disk. This is also freed, but the memory is not returned to the OS.
  • the memory used by opt/llc during compilation. Since these are executed by the aot compiler, the aot compiler memory usage is added to this.

Some ideas for improvements:

  • running a small number of optimization passes on llvm functions just after they are compiled i.e. InstCombine might reduce the size of the initial llvm module so it takes up less memory.
  • running llc/opt outside the aot compiler i.e. from msbuild or from a driver process would make sure the memory usage doesn't add up.

@lateralusX
Copy link
Member

lateralusX commented Dec 8, 2023

Been looking at this for a couple of days:

  • There are some smaller memory leaks around the cross compiler we could fix, but its not substantial, but will fix since I found them.
  • We lack intrinsic support for Vector256/Vector512 and we don't run dead code elimination pass to remove a lot of the branches that should be removed and since .net8 uses much more intrinsic in S.P.C it will bloat methods, like SpanHelpers::NonPackedIndexOfAnyValueType ends up with a 8MB CFG on Mono side, and since we don't run dead code elimination on Mono side, all that goes into LLVM module. In the end, cross compiler will consume gigabytes of memory compiling S.P.C on .net8.
  • Since we can't run MONO_OPT_BRANCH on LLVM, we would need to run LLVM function passes after emitting method into LLVM to reduce as much code as possible as early as possible.
  • I have also been thinking about remove dependency between mono cross -> opt, llc, so we can get rid of LLVM module form memory in mono cross compiler when running the other tools. An alternative option is running opt and llc directly in memory since we already have the LLVM module loaded and available, I did something similar in a previous POC, in that case we ran opt directly in cross compiler.

I will fix the things I hit so far.

@tannergooding
Copy link
Member

Big issue is that we lack intrinsic support for new Vector256/Vector512 and Avx512 meaning isSupported and isHardwareAccelerated is not exchanged to intrinsic

Could you help me to understand this issue? The APIs in question are recursive and so Mono "must" have existing handling for them, even if just to cause them to return false (otherwise the code would simply not work and would overflow the stack). Because this handling must exist and can only return constant true or false today, it should be possible to eliminate it as dead code and so none of these code paths should be impacting the compilation for Mono. Even if this elimination happens as part of LLVM, I would expect LLVM to itself do the dead code elimination and simply not process these blocks at all.

Is the issue that LLVM is doing other optimizations first, rather than an early elimination of dead code blocks?

-- In RyuJIT we handle the general mapping to constant true/false here:

There is then a fallback for other special APIs, which may be unrecognized here:

RyuJIT then likewise does an early removal of dead code blocks as part of importation, to improve throughput for latter phases.

@lateralusX
Copy link
Member

lateralusX commented Dec 11, 2023

Correct, there are fallbacks for System.Runtime.Intrinsics.X86 classes IsSupported method to be false as well as fallback for System.Runtime.Intrinsics IsHardwareAccelerated to be false, so it will still "work" even if there are no specific implementation in runtime for these intrinsic.

We do miss support for intrinsic for Vector512<T>.IsSupported, so that will end up in a call, but will be optimized by LLVM opt later (but we will still generate code). We are also lack intrinsic support for X86.Avx512* and Vector256/Vector512 so that ends up being inlined and since most of these are marked always inline, methods will be bloated. I updated my initial comment with additional findings around fallback logic that is implemented on Mono.

Main issue is that we don't run regular JIT dead code elimination pass on the generated CFG when we use LLVM. In that case we rely on LLVM opt, but we only run that as an out of proc process after generating the full LLVM module (covering the whole assembly), so we end up generating all code, consume a lot of memory.

It would be great if we code do elimination of dead code blocks early in Mono as well, that would reduce the memory foot print when compiling assemblies like S.P.C.

I'm still investigating this issue, so I will find more details when I start to eliminate the initial detected issues.

@lateralusX
Copy link
Member

lateralusX commented Dec 14, 2023

Using unlinked S.P.C during investigation of potential fixes that could have significant impact on cross compiler memory size during this investigation.

Before any changes running a full AOT compile of S.P.C targeting x64 ended up with a cross compiler memory usage of ~6 GB in .net8. I have investigated and implemented the following fixes that will dramatically reduce the memory usage in this scenario targeting x64:

  • Fixed a number of memory leaks as part of compile_method, leading up to MB of leaked memory on large assemblies.
  • Added logic to run a couple selected LLVM function passes run per LLVM function. That pass will simplify the LLVM function as well as eliminate dead code.
  • Added explicit IsSupport and IsHardwareAccelerated for unsupported x86 intrinsic into simd-intrinsics.c, there are fallbacks that will handle this (or we would end up with recursion errors), but made things more explicit in code, making it easier to detect areas we are lacking and could improve going forward.
  • Added ILLink substitution for System.Runtime.Intrinsics not supported on Mono. This reduce the size of S.P.C dead code at build time, since ILLinker can now do a much better job eliminating dead code so bloated methods won't even hit cross compiler.
  • Added PNSE implementation for x86 intrinsics not supported on Mono.
  • Added logic to ignore aggressive inlining flag for intrinsics not supported by Mono in cross compiler, these method can still be inlined, but they will follow regular inline cost heuristics, meaning that they can still be inlined, but won't trigger the excessive inlining identified in some methods. Right now Vector256<T> and Vector512<T> types are handled this way since they are not supported on Mono and will most likely end up as dead code anyways in cases where they end up triggering excessive inlining.

With the above changes, compiling unlinked S.P.C ends up at ~1.3 GB of memory usage, so a rather dramatic improvement from original 6 GB.

I will also look into implementing a driver option in cross compiler as part of this effort. I will probably add a new driver option to aot compiler, that in turn will run aot compiler as a separate process using the asm-only + no_opt, it will then run opt, llc, asm as separate processes. That will make sure we release memory used by cross compiler before running LLVM tooling (opt/llc) and that should improve the scalability and parallelization on machines.

Still working on the fixes in a down stream repro, so will complete work there first and then upstream relevant changes.

tannergooding added a commit to tannergooding/runtime that referenced this issue Jan 12, 2024
vargaz pushed a commit that referenced this issue Jan 12, 2024
…esolved (#96875)

* Exclude System.Numerics.Tensors.Tests from wasm aot until #95791 is resolved

* Also exclude System.Numerics.Tensors.Net8.Tests
@lateralusX
Copy link
Member

PR reducing memory footprint, #97096. Will do changes to add driver mode to AOT compiler as a separate PR.

@lateralusX
Copy link
Member

lateralusX commented Jan 19, 2024

PR implementing driver mode in Mono AOT cross compiler, reduce machine memory usage per compiled assembly, not keeping cross compiler instance alive when running tools like opt and llc, #97226.

tmds pushed a commit to tmds/runtime that referenced this issue Jan 23, 2024
…is resolved (dotnet#96875)

* Exclude System.Numerics.Tensors.Tests from wasm aot until dotnet#95791 is resolved

* Also exclude System.Numerics.Tensors.Net8.Tests
pavelsavara added a commit to pavelsavara/runtime that referenced this issue Apr 18, 2024
pavelsavara added a commit that referenced this issue Apr 18, 2024
matouskozak pushed a commit to matouskozak/runtime that referenced this issue Apr 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants