Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate more efficient ARM64 prologs/epilogs #88823

Open
filipnavara opened this issue Jul 13, 2023 · 6 comments
Open

Generate more efficient ARM64 prologs/epilogs #88823

filipnavara opened this issue Jul 13, 2023 · 6 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-mac-os-x macOS aka OSX
Milestone

Comments

@filipnavara
Copy link
Member

During the investigation of #88292 I found that NativeAOT/ARM64 and R2R never generates frameless methods. A typical app ends up with >30% of methods with simple frame prolog/epilog with no callee saved registers or extra stack space. Most of these methods are very likely to be leaf methods which can be frameless.

For example, take this simple method:

int Square(int num) { return num * num; }

NativeAOT generates the following code:

Program:Square(int):int (FullOpts):
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mul     w0, w0, w0
            ldp     fp, lr, [sp], #0x10
            ret     lr

An optimizing C compiler (clang -O) generates:

square: // @square
  mul w0, w0, w0
  ret

Not only the code size is significantly smaller, but it also saves a lot of space for the unwinding information.

@filipnavara filipnavara added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 13, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 13, 2023
@ghost
Copy link

ghost commented Jul 13, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

During the investigation of #88292 I found that NativeAOT/ARM64 and R2R never generates frameless methods. A typical app ends up with >30% of methods with simple frame prolog/epilog with no callee saved registers or extra stack space. Most of these methods are very likely to be leaf methods which can be frameless.

For example, take this simple method:

int Square(int num) { return num * num; }

NativeAOT generates the following code:

Program:Square(int):int (FullOpts):
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
            mul     w0, w0, w0
            ldp     fp, lr, [sp], #0x10
            ret     lr

An optimizing C compiler (clang -O) generates:

square: // @square
  mul w0, w0, w0
  ret

Not only the code size is significantly smaller, but it also saves a lot of space for the unwinding information.

Author: filipnavara
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member

Dup of #35274?

@filipnavara
Copy link
Member Author

An orthogonal issue would be to optionally generate prologs that are compatible with Apple Compact Unwinding:

  1. The frame would need to follow a structure where x29/x30 (frame pointer, link register) are saved at the top, and the x29 registers points to the bottom of this chained structure.
  2. Callee saved registers are always saved in pairs even if an odd number of them is used. At first glance this may look like a waste but it's not. The stp instruction can save two registers at a time, and the stack always has to be 16-byte aligned anyway.
  3. Callee saved registers need to be saved in certain order.

An example of Apple compatible prolog is

stp x24, x23, [sp, #-0x40]!
stp x22, x21, [sp, #0x10]
stp x20, x19, [sp, #0x20]
stp x29, x30, [sp, #0x30]
add x29, sp, #0x30 ; x29/fp points just below the saved chain

The actual instructions and their order are not important, the resulting frame layout is.

Cursory observation of JIT output shows that:

  1. For certain frame types the x29/x30 registers are saved at the top (correct location) but then new frame pointer (x29) is set to point to the bottom of the frame, below callee saved registers.
  2. Odd number of callee saved registers results in a sequence that aligns the stack to 16-byte boundary, but it doesn't use the pattern that saves extra register.
  3. The order for non-FP registers seems to be fine. FP registers and combination of FP and non-FP registers would need to be checked.

@filipnavara
Copy link
Member Author

Dup of #35274?

I suppose it is a duplicate in a way, although I am specifically focusing on NativeAOT here which has slight differences in the GC suspension architecture. Also, the numbers in that issue don't correspond at all to my observations, and they don't take into account the size of unwinding information emitted in the NativeAOT case.

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2023

Inline all the small methods! 🙂

@jkotas jkotas added the os-mac-os-x macOS aka OSX label Jul 16, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone Jul 21, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 21, 2023
@JulieLeeMSFT
Copy link
Member

Moving to Future because we are past .NET 8 Preview 7 code complete due date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-mac-os-x macOS aka OSX
Projects
None yet
Development

No branches or pull requests

5 participants