[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

BruceForstall · 2020-04-22T04:44:17Z

During the original development of the arm32 product, it was decided that the lr register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (CodeGen::genPushCalleeSavedRegisters()):

// It may be possible to skip pushing/popping lr for leaf methods. However, such optimization would require
// changes in GC suspension architecture.
//
// We would need to guarantee that a tight loop calling a virtual leaf method can be suspended for GC. Today, we
// generate partially interruptible code for both the method that contains the tight loop with the call and the leaf
// method. GC suspension depends on return address hijacking in this case. Return address hijacking depends
// on the return address to be saved on the stack. If we skipped pushing/popping lr, the return address would never
// be saved on the stack and the GC suspension would time out.
//
// So if we wanted to skip pushing pushing/popping lr for leaf frames, we would also need to do one of
// the following to make GC suspension work in the above scenario:
// - Make return address hijacking work even when lr is not saved on the stack.
// - Generate fully interruptible code for loops that contains calls
// - Generate fully interruptible code for leaf methods
//
// Given the limited benefit from this optimization (<10k for mscorlib NGen image), the extra complexity
// is not worth it.

This decision was maintained when arm64 support was added.

Should this decision be reconsidered?

For arm64, empty methods have this minimum code size:

stp     fp, lr, [sp,#-16]!
mov     fp, sp
ldp     fp, lr, [sp],#16
ret     lr

Question: if function A was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Even simple leaf methods require this prolog/epilog, e.g.:

G_M1350_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M1350_IG02:
        F9401400          ldr     x0, [x0,#40]

G_M1350_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.

The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.

There might also be implications to not saving and establishing fp on debugging, stack walking, etc.

Comments? @jkotas @AndyAyersMS @kunalspathak

category:question
theme:prolog-epilog
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

jkotas · 2020-04-22T05:09:01Z

Even simple leaf methods require this prolog/epilog,

Does the simple method in your example really need to establish fp? fp should not be required for this (one less instruction).

if function A was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Yes, the current JIT heuristics around GC suspension really only guarantee that the GC suspension has a chance of succeeding. They do not guarantee that the GC suspension happens in timely manner. It is not hard to construct real-world looking code that hits issues like this. It is something we should be working towards fixing, but I am not sure we would want to fix this by generating fully interruptible code for all loops. It needs data about the tradeoffs.

Should this decision be reconsidered?

Sounds fine to me. It needs data about the tradeoffs.

jkotas · 2020-04-22T05:11:27Z

E.g. interesting data point would be: If we stopped generating the frame for small functions (ie smaller code) and started generating fully interruptible code for all loops (ie bigger GCInfo), is the CoreLib native image going to be smaller or bigger overall?

BruceForstall · 2020-04-22T17:14:48Z

Does the simple method in your example really need to establish fp?

It seems like it doesn't need to: I think we establish it for Windows ETW and stack walking (maybe just gdb debugging) on Linux. I'm not sure if it's required for correctness.

E.g. interesting data point would be...

I think fully-versus-partially interruptible was on a function basis; we don't do it just for particular loops or code regions. The question is whether the GC info supports it on a region basis.

Assuming it is supported (or could be engineered), then gathering that data would be interesting.

Question: as I understand it, GC suspension is, very roughly:

while (true) {
  stop all threads
  if (all threads are at a GC safe point)
    break;
  if (we're tried this multiple times and haven't succeeded) // or maybe immediately?
    set up return address hijacks
  restart all threads
  wait some amount of time
}

How many "tries" (or loops) normally happen before we succeed? (And how costly is it?) Would changing our fully interruptible / call / loop strategy make GC suspension faster and more reliable?

kunalspathak · 2020-04-22T17:22:55Z

Just to share one data point - In the R2R generated code of framework libraries for x64, 23% of methods don't have prolog/epilog. These methods can be optimized for arm64 as well.

jkotas · 2020-04-22T17:46:53Z

The question is whether the GC info supports it on a region basis.

I believe that the GC Info encoding (the non-x86 version at least) does support it on region basis. Look for DefineInterruptibleRange. It was used by JIT64.

Assuming it is supported (or could be engineered), then gathering that data would be interesting.

You can gather the data even for setting on the whole method: It will give us upper bound.

GC suspension is

Here is a more accurate version:

do {
  retry = false;
  foreach thread {
      if (thread in preemptive mode or stopped at GC safe point) continue;
      stop the thread
      if (in fully interruptible region) continue; // at GC safe point
      try to setup return address hijack
      resume thread
      retry = true;
  }
} while (!retry);

How many "tries" (or loops) normally happen before we succeed?

The typical case is that we go through one iteration of the loop, and by the time we get to the second one iteration all threads hit return address hijack or some other GC safe spot. The typical methods are small and so they will hit return address hijack quickly.

The fully interruptible is important for the long running loops. It is important for predictable performance - the 98+% latencies, etc.

VSadov · 2024-06-19T19:29:35Z

One interesting thought by @filipnavara is that partially interruptible leaf methods may not need to support external unhijacking as, once hijacked in such method, a thread will have to inevitably unhijack itself - either by hitting the hijack or exiting the frame exceptionally (i.e. deref a NULL case)

There is never a need to move a hijack lower, if it is already on a leaf call, and that could be useful property to consider.

filipnavara · 2024-08-29T13:09:35Z

@EgorBo and I made some quick prototype earlier this week. It implements the NativeAOT side of the LR-based hijacking and some trivial support in JIT to recognize leaf methods and emit them as frameless. Notably it does NOT include the CoreCLR bits and any logic to handle R2R versioning. It was only tested on osx-arm64.

The purpose of the prototype was two-fold. First, to show that the hijacking logic is relatively easy to implement (~100 lines of code). Second, to measure any size impact of the change on typical executable.

The hijacking logic was tested on two use cases:

Simple non-inlined method (verified to be emitted as frameless) called from a loop. Other thread was continually running GC.Collect(2). It was verified in debugger that the framless method was successfully hijacked and allowed the GC to proceed.
A more complex non-inlined frameless method that throws a HW exception (eg. null dereference) and that is called from a loop. This test used slightly rigged JIT to generate non-interruptable frameless method with loop (to aid the hijack timing). It was verified in debugger that the exception is successfully processed within a hijacked leaf method, and that a parallel GC received a valid context for stack walking while the exception handling was taking place.

Moreover, all NativeAOT smoke tests were run several times with the change and passed with no error.

Now, for the size data. I only checked two trivial cases - nativeaot/SmokeTests/UnitTests and empty .NET MAUI app. In both cases I measured size change in object file in ILC output, and number of frameless entries in the compact unwinding table. The later measurement is an imprecise approximation since it doesn't count methods that have no frame but have stack pointer adjusted, and there's at least few of those present.

For empty .NET MAUI app the object file shrank by 27,896 bytes. There's at least 2,167 methods, out of 33,785, that are now frameless (6%). For UnitTests the object file shrank by 16,496 bytes. There's at least 1,389 methods, out of 10,511, that are now frameless (13%).

The size savings would be slightly bigger on Linux/Windows due to larger impact on the size of unwinding data. On macOS and other Mach-O platforms we end up generating compact unwinding data which should not amount to any size difference in this case.

I didn't measure any performance data. I expect that in most cases the methods that would end up leaf frameless methods would be inlined. There are at least some Vector<X> fallbacks becoming frameless and that could possibly have same measurable impact in R2R or tiered compilation but that was not tested in the prototype.

VSadov · 2024-09-04T18:21:51Z

The prototype looks very convincing!

Notably it does NOT include the CoreCLR bits and any logic to handle R2R versioning. It was only tested on osx-arm64.

We are likely to end up bumping min R2R version for 10.0 anyways for a few other reasons (ex: doing #104336 in 10.0 would also require R2R min version bump, maybe some other changes too).
That would take care of the versioning. It is reasonable to not deal with R2R versioning in the prototype.

NativeAOT is more approachable for experimentation in this area, since hijacking is done in a more straightforward way. We will work towards making CoreCLR similar, but so far that work has been incremental and there are still differences that make similar experimenting with CoreCLR a bit more unwieldy.
However, for the testing purposes in GC-root reporting areas we rely a lot on JIT-based GC-stress, so it would be interesting to see a CoreCLR counterpart of this and run that under GC-stress. Sometimes that finds pretty gnarly design-challenging scenarios that were not expected.

For empty .NET MAUI app the object file shrank by 27,896 bytes. There's at least 2,167 methods, out of 33,785, that are now frameless (6%). For UnitTests the object file shrank by 16,496 bytes. There's at least 1,389 methods, out of 10,511, that are now frameless (13%).

Another example that often works as a good proxy of a "mid-size average app" is System.Collections.Concurrent.Tests executable for NativeAOT. It would be interesting, if it is not too hard, to see the size impact on that.

VSadov · 2024-09-04T18:34:39Z

I can help with CoreCLR counterpart. Although the codemanager changes are fairly small, so perhaps porting that could be relatively straightforward.

filipnavara · 2024-09-04T19:14:00Z

Another example that often works as a good proxy of a "mid-size average app" is System.Collections.Concurrent.Tests executable for NativeAOT. It would be interesting, if it is not too hard, to see the size impact on that.

I don't have the size data, but at least 9860 methods out of 82797 get frameless with the prototype in that test.

I have different branch where I improved the compact unwind encoding and got measurements for the frameless methods w/ SP offset. In the MAUI sample it was a single digit number (7 IIRC), so it's not really statistically significant.

I can help with CoreCLR counterpart. Although the codemanager changes are fairly small, so perhaps porting that could be relatively straightforward.

I gave it an hour or two few days ago. I was bit discouraged that some of the hijacking code was scattered all over the place. I don't expect it to be hard to implement but it could certainly benefit from some cleanup. Any help is definitely welcome.

I've made couple of prototypes in the last few days to address some long-standing issues or optimization opportunities. I'm currently gauging interest to see what sticks and what is worth pursuing in the .NET 10 release cycle.

BruceForstall added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Apr 22, 2020

BruceForstall added this to the Future milestone Apr 22, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 22, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2020

kunalspathak mentioned this issue May 5, 2020

Improving ARM64 Performance in .NET 5.0 – Closing the gap with x64 #35853

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

AndyAyersMS mentioned this issue Aug 11, 2021

PGO: test failure in VolatileTest_op_mul on linux arm32 #57219

Closed

SingleAccretion mentioned this issue Feb 7, 2022

ArmArch codegen for empty / simple methods is not optimal #64912

Closed

kunalspathak mentioned this issue Oct 14, 2022

Improving Arm64 Performance in .NET 8.0 #77010

Closed

28 tasks

jakobbotsch mentioned this issue Jul 13, 2023

Generate more efficient ARM64 prologs/epilogs #88823

Open

jakobbotsch mentioned this issue Apr 2, 2024

Redesign the frame layout to avoid the redundant computation of the stackOffsets #99278

Closed

EgorBo mentioned this issue Jul 30, 2024

Quality of native perf profiling on x64 #105690

Open

filipnavara mentioned this issue Sep 1, 2024

[NativeAOT] Evaluate use/benefits of compact unwinding on osx-x64 and osx-arm64 #76371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

BruceForstall commented Apr 22, 2020 •

edited

Loading

jkotas commented Apr 22, 2020

jkotas commented Apr 22, 2020

BruceForstall commented Apr 22, 2020

kunalspathak commented Apr 22, 2020

jkotas commented Apr 22, 2020 •

edited

Loading

VSadov commented Jun 19, 2024

filipnavara commented Aug 29, 2024 •

edited

Loading

VSadov commented Sep 4, 2024 •

edited

Loading

VSadov commented Sep 4, 2024

filipnavara commented Sep 4, 2024

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

Comments

BruceForstall commented Apr 22, 2020 • edited Loading

jkotas commented Apr 22, 2020

jkotas commented Apr 22, 2020

BruceForstall commented Apr 22, 2020

kunalspathak commented Apr 22, 2020

jkotas commented Apr 22, 2020 • edited Loading

VSadov commented Jun 19, 2024

filipnavara commented Aug 29, 2024 • edited Loading

VSadov commented Sep 4, 2024 • edited Loading

VSadov commented Sep 4, 2024

filipnavara commented Sep 4, 2024

BruceForstall commented Apr 22, 2020 •

edited

Loading

jkotas commented Apr 22, 2020 •

edited

Loading

filipnavara commented Aug 29, 2024 •

edited

Loading

VSadov commented Sep 4, 2024 •

edited

Loading