Compiler Support for Debug On Restore #18866

dsouzai · 2024-01-31T21:19:47Z

This issue specifically tracks the Compiler support for Debug on Restore.

Parent issue: #17642

Approach

See #17642 for the discussion.

Generate FSD Code pre-checkpoint
Post-restore, compile non-FSD code and recompile all FSD bodies compiled pre-checkpoint

Implementation

In order to balance the constraints of startup/first response, footprint, and throughput, there are several additional pieces that need to be put in place.

AOT validation must be delayed until post-restore. Otherwise, the validation framework will see that the compiler is running in FSD mode but the (already populated) SCC does not hold any FSD compatible code, leading to no AOT code getting loaded even post-restore.
The IProfiler needs to be enabled during startup even if an SCC exists; this is because the SCC is not going to be used and therefore the heuristics of depending on the persistent profiling info available in the SCC is not applicable.
Higher counts need to be used so that the IProfiler can collect sufficient data to preserve throughput. However, some of methods that would have been compiled at a low count are important for startup/first response. Thus, these important methods need to be proactively compiled in the checkpoint hook.
Failed compilations pre-checkpoint should be retried post-restore; this is because the failure may be entirely due to the constraint of having to generate FSD code.
Some IProfiler data pre-checkpoint may need to be dropped to reduce footprint; there may be more IProfiler data collected pre-checkpoint than is useful.

VM Coordination

Given the above compiler implementation details, there are three main pieces of VM coordination that has be implemented in order to ensure functional correctness

If the proactive compilations in the checkpoint hook (point 3 above) are done without FSD (which is probably a good idea to reduce the amount of recompilation that will need to occur post restore), then if debug on restore is enabled post-retore, the VM will have to force a transition to the interpreter in the restore hook even if the debug events aren't enabled yet. Otherwise, application threads will start running non-FSD code which will not support a decompile when the events do get enabled. EDIT: This may not be needed (see Compiler Support for Debug On Restore #18866 (comment)).
If debug is not required, but normal redefinition is allowed (Fast HCR), then the VM needs to ensure that FSD bodies are decompiled when a redefinition event occurs, and not just calling jitConfig->jitClassesRedefined. This is because the FSD bodies do not have OSR guards (i.e. Voluntary OSR); they depend on the VM triggering the transition (i.e. Involuntary OSR). GAC is aware of this consideration in CRIU: Add support for dynamic debug interpreter transition on restore #17642.
Events J9HookDisable needs to be reset post-restore. The way the JIT checks if an event is enabled is by invoking J9HookDisable; however, this also has the side-effect of disabling the event if it has not already been enabled. This is to ensure that an agent later on cannot enable events that JIT generated code cannot support. Post-restore, the JIT does a similar set of calls to check if FSD is enabled, and so these events should be reset. Add ability to re-enable disabled hooks post-restore #16959 tracks this requirement.

PRs

VM

When a Fast HCR event occurs, ensure FSD bodies transition to the interpreter (Point 2 in VM Coordination Section)
CRIU supports Java debugger via the restore option file #19754 (Point 3 in VM Coordination Section)

Regarding Point 3 in the VM Coordination section, the current plan is a two pronged approach: 1. the VM will reset the necessary bits to allow J9HookDisable to function correctly post-restore, and 2. the JIT will not invoke the J9HookDisable calls (at least for FSD) until post-restore under -XX:+DebugOnRestore.

The text was updated successfully, but these errors were encountered:

dsouzai · 2024-01-31T21:20:02Z

fyi @tajila @vijaysun-omr

vijaysun-omr · 2024-02-01T20:06:41Z

"The IProfiler needs to be enabled during startup even if an SCC exists..." ... I assume you mean before checkpoint ? i.e. I assume we can use the SCC after restore and therefore won't want to pay the cost of IProfiler in the post-restore period until first response (or whatever stage in startup process iprofiling is off for) is done.

dsouzai · 2024-02-01T20:59:14Z

Yeah that's right, that's why part of the heuristic work is to "Turn off IProfiler post-restore so it can be naturally enabled later".

dsouzai · 2024-03-01T23:13:39Z

If the proactive compilations in the checkpoint hook (point 3 above) are done without FSD (which is probably a good idea to reduce the amount of recompilation that will need to occur post restore), then if debug on restore is enabled post-retore, the VM will have to force a transition to the interpreter in the restore hook even if the debug events aren't enabled yet. Otherwise, application threads will start running non-FSD code which will not support a decompile when the events do get enabled

I think we may be able to avoid having to worry about this. We can still do all the proactive compilation, but we don't have to update the extra field of the J9Method; instead we can just keep a list of these methods along with the new body. On restore, if Debug is not specified, we can update the extra field then; if it is, then we ignore the list. We could even queue them for code cache reclamation.

dsouzai · 2024-03-15T16:30:13Z

The previous comment has the caveat that the cost of delaying the updating of the extra field should not impact startup/first response; if doing so is too expensive then we will have to default to the original design.

dsouzai · 2024-03-19T15:31:50Z

After speaking with @mpirvu and @vijaysun-omr there is another option that really only became apparent after all of the work we did to get the compiler support to the point it is now.

The main idea is to run interpreted pre-checkpoint, but use all the heuristics described in this issue to ensure that the environment post-restore in terms of JIT'd code, IProfiler, SCC Validation, etc. are all the same. Furthermore, depending on the evaluation of #18866 (comment), it may even be possible to delay updating the extra field until post-restore.

The benefits of doing so are

Low Footprint (since there won't be a need for any FSD code generation)
Simpler VM coordination post-restore (since there won't be a mix of Voluntary and Involuntary code)
No need for the VM to decompile in the restore hook

The drawbacks are

Longer runtime to checkpoint
Potential impact to startup/first response because of more interpreted methods on application thread stacks

Because of all the infrastructure that I already have in place in my branches, I don't think it should be all that much work to evaluate this idea, so I'll update once I have more information.

dsouzai · 2024-03-19T17:44:36Z

fyi @gacholio

dsouzai · 2024-03-21T15:48:45Z

Some observations from my first set of experiments, with the latest changes I have based on open PRs:

The current build involves generating FSD code pre-checkpoint, doing a set of (non-FSD) proactive compilation in the checkpoint hook, and then triggering recompilation of FSD methods post-restore. This seems to have minimal impact on startup and firstreponse (~1-2%) but a very large increase in footprint. Part of this is due to the increase in code cache and data cache, but also an increase in the SCC RSS.

I need to do a bit more investigation on this, but it's very likely due to the recompilations post-restore. Essentially, the process of triggering a recompilation will result in either the requesting thread or the comp thread touching the J9ROMClass at some point, which will bring the page back into the RSS. The reason I think this is likely is because when I moved the recompilations to occur pre-checkpoint, the RSS of the SCC went down. However, I haven't validated if this is consistent, so I don't want to confirm this just yet. Additionally, the code/data cache and internal memory is higher because now we have both FSD bodies and the recompiled non-FSD bodies.

I also tried running interpreted pre-checkpoint, but performing (non-FSD) compilations in the background, and then updating the relevant j9method->extra fields in the checkpoint hook. This does impact startup and footprint a bit more (4-5%), but the footprint is much lower than the previous builds; it is still higher than baseline, but a cursory investigation showed that the majority of it comes from the code/data cache and jit persistent memory. This is because we actually compile more methods than baseline; this is likely due to the fact that in baseline, just because many methods have their count set to 20 does not mean they actually decrement that count because they may have been inlined into other methods. In my prototype build however, in the pre-checkpoint hook, I compile everything that had their count set to 20; this is likely unnecessary.

Finally, it became apparent that delaying updating the extra field until restore is not feasible. jitMethodTranslated touches the J9ROMClass of the method, and so we will bring all of those pages back into the RSS¹.

Given the following two observations:

Triggering recompilation post-restore is very likely the reason for the increase in footprint, so the recomps need to occur pre-checkpoint
The VM will always need to decompile on restore if Debug is enabled on restore because delaying updating the startPC until restore increases footprint

Generating FSD code pre-checkpoint only to then also generate the recompiled non-FSD version pre-checkpoint seems wasteful; it increases the code/data cache and jit persistent memory usage. However, the benefit is that the pre-checkpoint run-time of the application is minimally impacted.

Only running interpreted pre-checkpoint prevents the complication of having both FSD and non-FSD code, and facilitates the means by which footprint should match baseline. However, it has the issue that the pre-checkpoint run-time is significantly impacted.

I should also note that I have yet to evaluate the throughput implications of all of this.

Theoretically jitMethodTranslated could be modified to take in two extra parameters that answer the questions J9ROMMETHOD_HAS_VTABLE(J9_ROM_METHOD_FROM_RAM_METHOD(method)) and J9ROMCLASS_IS_INTERFACE(currentClass->romClass) to prevent the need to actually touch those pages. ↩

vijaysun-omr · 2024-03-21T19:17:58Z

Thanks for the summary. A few factors that you bring up, but as a matter of opinion, I don't feel are very significant, i.e. we could choose to lower their importance to get to the final design point.

The complexity of dealing with FSD and non-FSD code pre-checkpoint. In a sense, you have already dealt with this complexity, and so I don't know if this is a major issue from a development viewpoint (it could be from a maintenance viewpoint).
The memory wastefulness of generating all the FSD and non-FSD code pre-checkpoint, since (correct me if I am wrong), you could essentially reclaim (in the code cache reclamation sense) all the FSD bodies that are not on some executing thread stack, right ? This, coupled with a plan to disclaim (from RSS) all the FSD code generated pre-checkpoint ought to reduce most of the waste associated with this scheme ?
The startup time performance of the case when debug is enabled on the restore side, i.e. the cost of always decompiling in that case

I guess I am arguing in favor of having FSD and non-FSD bodies before checkpoint, rather than run interpreted until then.

gacholio · 2024-03-21T19:43:45Z

All FSD bodies can be reclaimed as we will be decompiling any that are on stack at restore. You could even do it aggressively on restore after the decompilations have been added to the appropriate stack frames.

dsouzai added comp:vm comp:jit criu Used to track CRIU snapshot related work labels Jan 31, 2024

dsouzai mentioned this issue Jan 31, 2024

Compiler Support for CRIU #16853

Open

30 tasks

dsouzai mentioned this issue Feb 6, 2024

Generate FSD code pre-checkpoint under -XX:+DebugOnRestore #18894

Merged

dsouzai mentioned this issue Feb 20, 2024

Recompile FSD Bodies generated under -XX:+DebugOnRestore #18982

Merged

dsouzai mentioned this issue Mar 15, 2024

Refactor Checkpoint/Restore Code into its own class #19157

Merged

This was referenced Apr 12, 2024

Delay AOT validation until restore under -XX:+DebugOnRestore #19316

Merged

Heuristics to minimize performance impact of -XX:+DebugOnRestore #19342

Merged

dsouzai mentioned this issue Jun 6, 2024

CRIU: Add support for dynamic debug interpreter transition on restore #17642

Open

dsouzai mentioned this issue Jun 28, 2024

Generic Double Memory Map API eclipse/omr#7396

Open

dsouzai mentioned this issue Aug 23, 2024

CRIU supports Java debugger via the restore option file #19754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler Support for Debug On Restore #18866

Compiler Support for Debug On Restore #18866

dsouzai commented Jan 31, 2024 •

edited

Loading

dsouzai commented Jan 31, 2024

vijaysun-omr commented Feb 1, 2024 •

edited

Loading

dsouzai commented Feb 1, 2024

dsouzai commented Mar 1, 2024

dsouzai commented Mar 15, 2024

dsouzai commented Mar 19, 2024

dsouzai commented Mar 19, 2024

dsouzai commented Mar 21, 2024 •

edited

Loading

vijaysun-omr commented Mar 21, 2024 •

edited

Loading

gacholio commented Mar 21, 2024

Compiler Support for Debug On Restore #18866

Compiler Support for Debug On Restore #18866

Comments

dsouzai commented Jan 31, 2024 • edited Loading

Approach

Implementation

VM Coordination

PRs

VM

dsouzai commented Jan 31, 2024

vijaysun-omr commented Feb 1, 2024 • edited Loading

dsouzai commented Feb 1, 2024

dsouzai commented Mar 1, 2024

dsouzai commented Mar 15, 2024

dsouzai commented Mar 19, 2024

dsouzai commented Mar 19, 2024

dsouzai commented Mar 21, 2024 • edited Loading

Footnotes

vijaysun-omr commented Mar 21, 2024 • edited Loading

gacholio commented Mar 21, 2024

dsouzai commented Jan 31, 2024 •

edited

Loading

vijaysun-omr commented Feb 1, 2024 •

edited

Loading

dsouzai commented Mar 21, 2024 •

edited

Loading

vijaysun-omr commented Mar 21, 2024 •

edited

Loading