Generate more efficient ARM64 prologs/epilogs #88823

filipnavara · 2023-07-13T09:48:17Z

During the investigation of #88292 I found that NativeAOT/ARM64 and R2R never generates frameless methods. A typical app ends up with >30% of methods with simple frame prolog/epilog with no callee saved registers or extra stack space. Most of these methods are very likely to be leaf methods which can be frameless.

For example, take this simple method:

int Square(int num) { return num * num; }

NativeAOT generates the following code:

Program:Square(int):int (FullOpts):
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mul     w0, w0, w0
            ldp     fp, lr, [sp], #0x10
            ret     lr

An optimizing C compiler (clang -O) generates:

square: // @square
  mul w0, w0, w0
  ret

Not only the code size is significantly smaller, but it also saves a lot of space for the unwinding information.

The text was updated successfully, but these errors were encountered:

ghost · 2023-07-13T09:48:23Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

During the investigation of #88292 I found that NativeAOT/ARM64 and R2R never generates frameless methods. A typical app ends up with >30% of methods with simple frame prolog/epilog with no callee saved registers or extra stack space. Most of these methods are very likely to be leaf methods which can be frameless.

For example, take this simple method:

int Square(int num) { return num * num; }

NativeAOT generates the following code:

Program:Square(int):int (FullOpts):
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mul     w0, w0, w0
            ldp     fp, lr, [sp], #0x10
            ret     lr

An optimizing C compiler (clang -O) generates:

square: // @square
  mul w0, w0, w0
  ret

Not only the code size is significantly smaller, but it also saves a lot of space for the unwinding information.

Author:	filipnavara
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-07-13T09:52:24Z

Dup of #35274?

filipnavara · 2023-07-13T09:52:47Z

An orthogonal issue would be to optionally generate prologs that are compatible with Apple Compact Unwinding:

The frame would need to follow a structure where x29/x30 (frame pointer, link register) are saved at the top, and the x29 registers points to the bottom of this chained structure.
Callee saved registers are always saved in pairs even if an odd number of them is used. At first glance this may look like a waste but it's not. The stp instruction can save two registers at a time, and the stack always has to be 16-byte aligned anyway.
Callee saved registers need to be saved in certain order.

An example of Apple compatible prolog is

stp x24, x23, [sp, #-0x40]!
stp x22, x21, [sp, #0x10]
stp x20, x19, [sp, #0x20]
stp x29, x30, [sp, #0x30]
add x29, sp, #0x30 ; x29/fp points just below the saved chain

The actual instructions and their order are not important, the resulting frame layout is.

Cursory observation of JIT output shows that:

For certain frame types the x29/x30 registers are saved at the top (correct location) but then new frame pointer (x29) is set to point to the bottom of the frame, below callee saved registers.
Odd number of callee saved registers results in a sequence that aligns the stack to 16-byte boundary, but it doesn't use the pattern that saves extra register.
The order for non-FP registers seems to be fine. FP registers and combination of FP and non-FP registers would need to be checked.

filipnavara · 2023-07-13T09:56:26Z

Dup of #35274?

I suppose it is a duplicate in a way, although I am specifically focusing on NativeAOT here which has slight differences in the GC suspension architecture. Also, the numbers in that issue don't correspond at all to my observations, and they don't take into account the size of unwinding information emitted in the NativeAOT case.

EgorBo · 2023-07-13T13:43:09Z

Inline all the small methods! 🙂

JulieLeeMSFT · 2023-07-21T19:05:55Z

Moving to Future because we are past .NET 8 Preview 7 code complete due date.

filipnavara added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 13, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jul 13, 2023

jkotas added the os-mac-os-x macOS aka OSX label Jul 16, 2023

JulieLeeMSFT added this to the Future milestone Jul 21, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label Jul 21, 2023

EgorBo mentioned this issue Jul 30, 2024

Quality of native perf profiling on x64 #105690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate more efficient ARM64 prologs/epilogs #88823

Generate more efficient ARM64 prologs/epilogs #88823

filipnavara commented Jul 13, 2023

ghost commented Jul 13, 2023

jakobbotsch commented Jul 13, 2023

filipnavara commented Jul 13, 2023

filipnavara commented Jul 13, 2023

EgorBo commented Jul 13, 2023

JulieLeeMSFT commented Jul 21, 2023

Generate more efficient ARM64 prologs/epilogs #88823

Generate more efficient ARM64 prologs/epilogs #88823

Comments

filipnavara commented Jul 13, 2023

ghost commented Jul 13, 2023

jakobbotsch commented Jul 13, 2023

filipnavara commented Jul 13, 2023

filipnavara commented Jul 13, 2023

EgorBo commented Jul 13, 2023

JulieLeeMSFT commented Jul 21, 2023