-
-
Notifications
You must be signed in to change notification settings - Fork 440
Understanding the On Stack Replacement (OSR) optimisation in the HotSpot C1 compiler
Here is an explanation by Kris Mok (@rednaxelafx) of why an OSR compilation performed by HotSpot's C1 compiler produces native code for the entire method that contains the loop being OSR-compiled instead of just for the hot loop itself:
Kris:
It's quirky. OSR support in C1 was not as "integrated" as one might think, taking somewhat "hacky" shortcuts to make the implementation easy.
For OSR compilations, it's important to know a few pieces of information going in:
- At which backedge (in terms of the bytecode index, "bci") are we compiling this OSR task for?
- What is the JVM State (aka
ValueStack
in C1, i.e. Java locals / operand stack / monitors) on that backedge? - How does the interpreter pass in the actual JVM State at runtime when calling this piece of OSR compiled code? (i.e. OSR calling convention)
For C1, the answers to the questions above are:
- Passed in from the compiler interface, in terms of "entry_bci" or "osr_bci"
- C1 piggybacks on the normal compilation flow to figure that out (*). This is where things get a bit "hacky".
- C1/C2 and the JVM runtime agree on a calling convention where:
- the interpreter will malloc() an "OSR buffer" and copy the required JVM state into that buffer
- the interpreter will pass a pointer to the OSR buffer as the sole argument to an OSR compiled code
- the OSR compiled code will, at the OSR entry, "migrate" the state from the OSR buffer onto whatever the compiler can manage
- the OSR compiled code should free() the OSR buffer after migration is complete.
So the interesting bit above is (2). For OSR compilations, C1 parses the bytecode and builds the HIR just as it would for "standard compilation" (meaning coming in from the normal entry). After parsing the bytecode once, C1 would have enough information to tell what the JVM State is like at the specified OSR entry BCI.
It then does this hacky thing of "bolting on" an OSR entry block into the control flow graph (CFG) at the specified entry BCI, inserts HIR instructions to load the state (locals / operand stack / monitors) from the OSR buffer, and then also inserts Phi
nodes that merge the values flowing down from the standard entry point and the corresponding value just loaded from the OSR buffer. And that's it, the compilation process goes on just as it would for a normal standard compilation. The OSR entry block mostly just sits there throughout the rest of the compilation process, not receiving much of any special treatment until it's time to do block scheduling / register allocation (RA) and final code generation.
The end result of this is: you've got compiled code for the whole method, including the code that starts from the standard entry point, plus an extra OSR entry block containing the code to perform OSR migration and jump to the target of the specified backedge. It's technically possible to tweak the HotSpot runtime and C1 to allow a single C1 OSR compilation to support both the standard entry and an OSR entry (or a set of OSR entries), but currently only a single OSR entry point is used for C1 OSR compiles. (Android Runtime's "Optimizing" compiler is a descendant of C1 and the OSR compilation that it implements would actually emit all backedge targets as valid OSR entry points -- in a single compile. So we know this is doable in practice.)
This "hacky" way of doing it makes the implementation really simple: it doesn't require a separate pre-pass like C2's ciTypeFlow
(well C1's CFG + IR construction is a two pass thing anyway...) It wouldn't be that hard to make C1 compile only the loop, it's just more work, unless C1 adopts the same ciTypeFlow
like C2, which is going to impact compilation speed.
By the way the OSR entry block introduces extra Phi nodes for things in the JVM state: locals, operand stack. That's necessary because the interpreter may have seen state that's different from what C1 would have seen from the standard entry up to the OSR entry.
e.g. There might be two consecutive non-volatile loads from the same field in the original bytecode, and C1 would optimize them into one load, but the interpreter would actually do two loads and may observe different values from the loads, so it's not valid to simply/blindly optimize that out in OSR:
int a = obj.x;
int b = obj.x;
// OSR entry here
boolean z = a == b;
// Without appropriate Phi nodes, C1 would say this is always true,
// but the interpreter may see otherwise.
One bit of hack here is the handling of monitors. C1 doesn't emit Phi
nodes for monitors that are migrated from the OSR buffer. Instead, it just saves the migrated monitors into reserved stack slots, and always performs unlocking by loading from those stack slots. That can be very problematic if let's say we've got a variant of C1 that can potentially perform unlocking directly with a pointer in register instead of loading from stack -- you need the Phi
. I won't go into the gory details here, but that's something I've had to deal with first hand at work... :-(
A note on alternatives:
The story gets even more interesting when you put C2 into the picture.
C2's OSR support went through a couple of iterations. In the JDK1.4 era there was no ciTypeFlow
yet, and C2 would actually do a fake "pre-compile" to get information on the control flow graph and the JVM State at the OSR entry point, throw away that compile, and then do a real compile with that information. The code was really messy back then.
So this "pre-compile" pass was gotten rid of, and the essence of it was distilled and extracted into a "reusable" component called ciTypeFlow
. It's capable of computing enough information on what the JVM State is going to be like at the OSR entry point, and moreover, capable of computing the control flow graph that starts from the OSR entry point.
How complicated is that, you'd wonder.
Consider this nested loop case:
void foo(arg1, arg2, ...) {
for (...) {
// do something A
for (...) { // assuming an OSR entry point at this backedge
// do something B
}
// do something C
}
}
If we were to only start compiling from the OSR entry point in this example, we'd actually need to generate code that looks like:
OSR entry / migration
Run the inner loop with (B) as loop body until it finishes
Run (C) once to reach the backedge of the outer loop
Run the normal outer+inner loop until it finishes
So it's as if we had peeled a part of a single iteration of the outer loop out on its own, before we can run the normal version of the outer+inner loop. Of course it gets more complicated as the level of nested increases -- it's far from trivial.
So the way C1 implemented OSR saves you from having to think about this peeling: the code that starts from the standard entry point is already there, so regardless of loop nesting situations, you'd always have a valid place to jump to.
Extra Readings:
- JavaOne 2001: The Java HotSpot™ Virtual Machine Client Compiler: Technology and Application, Robert Griesemer, Srdjan Mitrovic, Sun Microsystems, Inc.
- JavaOne 2002: Client Compiler for the Java HotSpot™ Virtual Machine: Technology and Application, Tom Rodriguez, Ken Russell, Sun Microsystems, Inc.
Vladimir Ivanov (@iwan0www):
Really nice writeup & exceptional history digging, Kris! :-)
(Disclaimer: I can only guess about motivation behind design decisions in C1 by inspecting the code and reading accompanying materials. Also, I'm far behind Kris in history digging skills :-))
Quick note: most likely, OSR implementation was simpler initially and some complications were added incrementally over its lifetime. For example, OSR support in C1 was there from the very beginning (1.3), but C1 HIR was turned into SSA form only in 6. (So, no need to worry about Phis.)
While there's a comprehensive answer why C1 starts compiling from the very beginning of the method, it doesn't cover why the rest of the method (after the OSRed loop) is compiled as well.
Key aspect here is that C1 doesn't rely on profiling information during compilation.
Compared to C2, C1 kicks in early in startup/warmup (due to low thresholds in compilation heuristics) and it makes profile info much less useful (less samples & application not in steady state yet). Another consequence is C1 sees much more not-yet-executed code during compilation. To avoid recompilations due to unresolved method/field or not-yet-loaded class, it extensively uses code patching.
So, compiling not-yet-executed code is a viable decision. It allows to save resources on recompilations and still produce code of decent quality (modulo missed inlining opportunities through calls of not-yet-resolved methods).
(I deliberately differentiate not-yet-executed code (based on profile info) from dead code (provably unreachable).)
For example, if tiered compilation is off and only C1 is used there's only 1 OSR version produced:
$ java -XX:TieredStopAtLevel=1 -Xbatch -XX:+PrintCompilation OSRTest
...
529 97 % b 1 OSRTest::main @ 13 (107 bytes)
534 98 b 1 OSRTest::main (107 bytes)
On the contrary, C2 aggressively prunes not-yet-executed code and unresolved calls, so has to go through a series of OSRs while makes its way through the code for the first time:
$ java -XX:-TieredCompilation -Xbatch -XX:+PrintCompilation OSRTest
...
404 8 % b OSRTest::main @ 13 (107 bytes)
427 8 % OSRTest::main @ 13 (107 bytes) made not entrant
428 9 % b OSRTest::main @ 37 (107 bytes)
452 9 % OSRTest::main @ 37 (107 bytes) made not entrant
453 10 % b OSRTest::main @ 61 (107 bytes)
471 10 % OSRTest::main @ 61 (107 bytes) made not entrant
471 11 % b OSRTest::main @ 85 (107 bytes)
492 11 % OSRTest::main @ 85 (107 bytes) made not entrant
And that's only comparing C1 vs C2. In tiered mode, they serve different purposes, but I won't dive into that here.
Hope it helps.