-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRIU: Add support for dynamic debug interpreter transition on restore #17642
Comments
Based on a discussion with @vijaysun-omr, we came up with a few possible ways forward. 1. Disable all compiled codeThis is relatively straightforward to do; in fact this is what we currently do for 2. Fail the checkpoint and start a new JVM in default modeThis is probably less of an option for the JVM and more for Applications; an application can be configured to handle the failure and instead start a new JVM in default mode. This would not maintain Dev/Prod Parity, but it is a fallback option that would at the very least guarantee functionality from a Java Application User pov. 3. Generate code pre-checkpoint as if the JVM is running in FSD modeGenerating code as if the JVM is in FSD mode means running in Involuntary OSR Mode. This means any yield point can be a place where the VM triggers the transition of a thread from JIT'd code to the interpreter. The downside of this approach is that FSD compliant JIT code is around 30% slower. However, this may not matter too much for first response; for steady state throughput, these FSD bodies can be generated with GCR trees to force recompilation post restore. An important subtlety here is that if debut is not enabled post-restore but redefinition is still possible, the code cache will have some method bodies that support involuntary OSR (i.e. those that were generated pre-checkpoint) and the rest that support voluntary OSR. As such, the VM will need to check a (yet to exist) flag in the body's metadata to determine what type of OSR was used. When redefinition needs to occur, the VM will need to check, at a yield point, if the body was compiled to support involuntary OSR, and if so, decompile it regardless of the type of yield point; otherwise, normal Voluntary OSR mechanics apply. 4. Choose Async Checkpoints that, in Voluntary OSR Mode allow redefinition to occur, and add OSR transitions thereIf option 3 is too expensive, another approach is to run in a suboptinal Voluntary OSR Mode. Rather than run the Fear Analysis to minimize the OSR transition points, we force the transition points to be the exact set of yield points that are used to ensure that redefinition occurs; while this set is larger than what would result from an optimal OSR analysis, it is still likely smaller than the set of points in option 3. However, an important caveat here is that any yield point that is not used to ensure that redefinition occurs must be ignored by the VM for the purpose of checkpointing; the thread should be allowed to continue execution until it hits one of these yield points that is also a transition point (it is guaranteed that the thread will not execute indefinitely before reaching such a point). Another caveat is that we will need to add Voluntary OSR support for AOT (#4849). |
I am going to start investigating the perf impact of option 3 first. Specifically, I will generate two builds where,
|
FYI @gacholio |
I'm not too familiar with this detail, how do we differentiate this in the VM? There are two main mechanism we use, exclusive and safepoint exlcusive. @gacholio thougts. |
My impression from discussion with Tobi is that we would just discard all the compiled code if debug was enabled on restore. This avoids any number of difficult issues. The checkpoint code uses safepoint exclusive, so all threads will certainly be at an OSR point. |
@gacholio that is captured in Irwin's case 3 and 4. From my understanding, what Irwin is saying is that the JIT either needs to be in FSD mode (non default) or Voluntary OSR (default) mode for us to decompile the JIT frames on the stack.
To me this sounds like we could then use case 4, which is the cheaper option. |
The OSR I'm talking about is I believe involuntary, in that we force it on all threads (it's not induced by a failed check in the compiled code). Does involuntary require FSD? I didn't think so. |
FSD involves involuntary OSR; normal HCR enabled mode uses voluntary OSR. |
So either we need to start in involuntary mode always (or at least if we want to support the possibility of debug) or add guards at every OSR point to check for the switch (maybe this can be done via the assumptions mechanism?). |
Well, once the guards are patched it will always transition to the VM. As such, once we enter into debug mode, the entire code cache might as well be discarded (same with the AVL trees). However, if we don't enter debug mode, the code quality should be better than with involuntary osr mode. Also, with this approach, at the time when the VM wants to stop threads to prepare for checkpoint, if the thread hits some other yield point that isn't an OSR transition point, it needs to be allowed to return back to running JIT'd code; it's only in involuntary osr mode that all yield points are OSR transition points. That's why if we can get away with involuntary osr mode pre-checkpoint, that would be the simplest approach to take. |
This is the part that is challenging. Im not sure how we detect this. |
I believe we will be reinitializing the send targets for all methods when we restore, which has the effect of abandoning all of the compiled code (by which I mean the interpreter will never invoke it again), so normal CCR should be able to discard the old method bodies once every running invocation has OSRed back to the interpreter. |
Let's not do this - it's essentially another layer of exclusive on top of safepoint, which would be completely unmanageable (I'd already like to see some proof that safepoint is valuable given how many problems it has had). |
@vijaysun-omr could elaborate more on this perhaps, but he did mention that there's only very specific bytecodes that matter for the purpose of (in a normal run) ensuring that we yield to allow a redefinition event (for example, if we're in a loop with no monents/invocations, we need to ensure that we don't loop indefinitely). If there's some way to identify at the yield point / transition point what the bytecode is supposed to, we would be able to distinguish between normal yield points and OSR transition points. Of course the critical point here is that the set of osr transition points must be the set of yield points that are necessary to ensure a redefinition event. It may also be that when we transition via OSR, we end up in a different place than when we yield via a yield point, so that too could be a distinguishing factor. That said, I don't know if what I just described is absolutely accurate, so I'll let Vijay clarify. |
I am under the impression that under our present default HCR implementation, the VM only allows actual class redefinition to occur at certain yield points, and my understanding is that those yield points are 1) async checks 2) method calls (probably via stack overflow check) and 3) monitor enter. If this is not how the VM is doing class redefinition, then please clarify. If this is how the VM is doing class redefinition, then I don't understand what more is needed in order to support option 4 in Irwin's post. |
Redefinition can occur at any place that releases VM access. These would include:
With some exceptions, if you call out from compiled code, that's a redef point (some JIT helpers will never release VM access, so we'll need to be very careful in future if we change a helper and the JIT has assumed it will not release VM access). The only practical solution for compiled code is to discard it entirely on restore (i.e. post decompiles for every compiled frame in every thread). This will naturally result in the debug interpreter being invoked after the decompiles. Safepoint HCR means that object allocation is not an OSR/decompile point (the checkpoint code gets that kind of access if necessary). The requirement is that we have an OSR block at all of the possible locations that a method could be paused (by safepoint exclusive). I'd rather not rely on guards to accomplish this since it would be very hard to distinguish which points will rely on the guard fail and which need to be forced into OSR. When we restore, we will mark all frames in all stacks for decompile, and reset all method send targets back to their default (count and compile in the JIT case). Eventually, the obsolete compiled code will be unrerefenced and able to be reclaimed. |
That list of program points in compiled code from @gacholio where class redefinition may occur (ignoring FSD for the moment) is what we used to have, until some more OSR changes were done to the design a few years ago was my understanding. The basis of this understanding is this code : The code under the if-condition I pasted only checks for calls, async checks and monitor enters as spots where it needs to arrange for OSR transitions ("post execution OSR" there means it will set up the OSR transition after those operations are done and we return back to the JITed code) https://github.com/eclipse/omr/blob/2d5ac63fbe881f0af035ef2732b22f85eb3893dd/compiler/compile/OMRCompilation.cpp#L637 There is also this comment that alludes to what that code does:
There must have been some VM code added to ensure we only redefine at those 3 points since the JIT is not in charge of where class redefinition occurs. The point of debate being this category which the above JIT code does not seem to consider anymore as a place where redefinition is possible:
|
You are likely referring to safepoint OSR, which only eliminates object allocation from the list of HCR points: openj9/runtime/compiler/control/J9Options.cpp Lines 3088 to 3093 in b034032
Looking at the code, in HCR (not FSD) mode, the VM does not force decompile anywhere - it calls So, I suppose it's up the JIT to determine where HCR checks need to be inserted to ensure correctness. One thing I think we've all forgotten (and I've just remembered) is that HCR does not affect existing frames on the stack. The requirement is that all new method invocations target the most current version of the method. This may mean that existing HCR/OSR is not sufficient to accomplish what's needed here as we will be unable to simply discard the code cache like we do for FSD (extended) HCR. |
There are two different concepts at play here:
In the case of FSD, i.e. when we use Involuntary OSR, the sets of these points end up being the same from the point of view of the JIT because all those yield points mentioned by Gac are decompilation points. In the case of default HCR, i.e. when we use Voluntary OSR, from the point of view of the JIT, redefinition and decompilation points are not necessarily the same. In general, a thread yields to the VM to allow a STW redefinition event to occur, and then the thread continues executing until it reaches a decompilation point. The only yield points that could be redefinition points are, as Vijay mentioned, the selected What Option 4 in #17642 (comment) proposes is to essentially make the set of redefinition points (from the JIT's pov in Voluntary OSR mode) also the set of decompilation points. This can be implemented in two ways:
1 is obviously the cleaner approach, but 2 may be more practical in terms of being able to reuse non-FSD infrastructure. At any rate, the question of what are redefinition points and what are decompilation points is an orthogonal concern to Option 4 above, which banks on the fact that we must already able to distinguish between the two for HCR. All that said, if the assumption that redefinition cannot occur outside of |
Clasically, HCR could occur any time VM access can be released. That includes all of the places (and possibly more) that I detailed above. The only HCR change I can think of is the safepoint OSR (which I think you refer to as When the HCR occurs, the VM does not add any decompilations - it reports the modified classes/methods so the JIT can do the appropriate patching (presumably invalidating calls to any potentially-replaced methods). As stated above, there's no need to decompile when the thread resume - it's fine to wait until a new method invocation is going to take place (even then, if you know that the invoked method has not been replaced, you can just go ahead and invoke it). |
It's tempting to use voluntary OSR to let the decompiles trickle in as the compiled code detects the restore, but this won't work properly in the debugger (an obvious example is that the debugger would not be able to query locals in frames that remain compiled without FSD). I think the only way this will work is to make every escape point (except allocation points in next gen) from the compiled code into an OSR point, and do the force decompile (involuntary) on restore. |
@gacholio that sounds a lot like Graeme's How valid this is depends on the user requirements but it seems like a reasonable position to me. |
I don't see the correlation , and I would have to say no to building on top of 20-year old abandoned tech (I doubt there's even a mention of it left in the codebase). It also does not address my above concern about locals. |
We don't want to reuse the |
After talking to @jdmpapin and Vijay, I believe that the three types of yield points I mentioned above do cover most of what is handled by safepoints. However, it may be that the resolve helpers are not handled; we'll have to take a look and see if we do handle it in some other way; either way we would have to make them an explicit OSR point.
Yeah that sounds right. Actually, additionally we need to make these points also the only points that a thread can yield to allow a checkpoint. Essentially, in Option 4, we need to have the set of Redefinition Points (Escape Points/HCR Points), the set of OSR Points (Involuntary OSR Transition Points), and the set of Checkpoint Points be the same set of points. Overall though, I do agree that if FSD compliant code pre-checkpoint is sufficient then we should just stick to that. |
I launched some perf runs to measure the impact of generating FSD compliant code. I ran the I had 3 builds:
|
Build | Startup Slowdown | First Response Slowdow |
---|---|---|
FSD Always | 5% | 4% |
FSD Pre-checkpoint | 4% | 3% |
restcrud
Build | Startup Slowdown | First Response Slowdow |
---|---|---|
FSD Always | 2.5% | 15% |
FSD Pre-checkpoint | 2.5% | 2% |
From the looks of things, the FSD approach (i.e. Option 3) looks to be sufficient to enable debug post-restore.
That said, there are some things that we need to address.
- If debug is not enabled post-restore, then in order to support "normal" HCR redefinition, at any yield point (not just the safepoints mentioned above), the VM will need to check a (yet to be defined) flag in the method's metadata to see if it is a FSD body; if it is, the VM will have to trigger an OSR transition. This is because FSD bodies do not have any OSR guards. Essentially, we have never been in a situation where we have Involuntary and Voluntary OSR method bodies at the same time.
- I have to ensure that throughput is not affected. I did some initial throughput runs, and it turns out that the FSD Pre-checkpoint build is not much better than the FSD Always build, which is ~40% worse than baseline. Part of this comes from the fact that the FSD compliant bodies do not get recompiled (because up until this point, we never had the need for FSD bodies to exist except if we knew for certain debug was enabled). However, this doesn't explain the entire gap, and so I need to investigate further; it may be that there are other optimizations that get disabled under FSD that I missed in the post-restore options processing where I reset the FSD flag.
Add a new private flag which instructs the interpreter to exit and re-invoke itself. This will be used by CRIU when a restored image requests debug capabilities (by changing the interpreter entry point to the debug interpreter). Related: eclipse-openj9#17642 Signed-off-by: Graham Chapman <[email protected]>
Support transition to debug interpreter on restore when the transition is requested with an env var file or an options file. Issue: eclipse-openj9#17642 Co-authored-by: Tobi Ajila <[email protected]> Signed-off-by: Amarpreet Singh <[email protected]>
How can I know the status of the issue? |
The compiler work is being tracked here #18866 (it does include some VM pre-requisites). For the most part, the compiler functional work is done, but we still need to reduce the footprint gap caused by generating FSD pre-checkpoint. |
How can I know if now is a good time to address the VM pre-requisites as it seems like we still need to reduce the footprint gap caused by generating FSD pre-checkpoint? |
The footprint gap and the VM pre-prerequisites are mutually exclusive; the work to reduce the footprint gap is not going to be impacted by the necessary VM changes. That said you should probably coordinate with @JasonFengJ9 since I believe he's working on the VM side debug on restore work. |
@JasonFengJ9 how can I potentially assist with the debug on restore work? |
The first openj9 portion of the debug on restore work was
The corresponding extension repo PR (initially opened by Mike Z., now with my changes) is awaiting review I have a draft PR for the second openj9 PR which is being tuned according to Irwin's perf results, the ETA is next week or so. There are quite a few other CRIU open issues, please talk to @tajila for a suitable task. |
It seems like a suitable task was discussed after I talked to @tajila for a suitable task. |
How can I contribute to the task? |
Background
We currently have 3 interpreters, normal one, criu and debug. Ideally, we would like to get in a position where we only have two interpreters, normal and debug. The CRIU interpreter was added because there were capabilities missing (method enter/exit checks) in the normal interpreter to support serviceability features like java method tracing dynamically upon restore.
Goal
Detect request to run with debug interpreter, then exit normal interpreter and continue in debug interpreter. If we can achieve this then we gain
Challenges
Places to detect change:
The text was updated successfully, but these errors were encountered: