Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recompile FSD Bodies generated under -XX:+DebugOnRestore #18982

Merged
merged 3 commits into from
Apr 12, 2024

Conversation

dsouzai
Copy link
Contributor

@dsouzai dsouzai commented Feb 20, 2024

  • Recompile FSD Bodies generated pre-checkpoint under -XX:+DebugOnRestore
  • Disable sample based recompilation pre-checkpoint under -XX:+DebugOnRestore

Parent issue: #18866

@dsouzai dsouzai added comp:jit criu Used to track CRIU snapshot related work labels Feb 20, 2024
@dsouzai
Copy link
Contributor Author

dsouzai commented Feb 20, 2024

@mpirvu could you please review?

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When maintaining queues of methods to be compiled we have to observe any class unloading and redefinition events and purge entries from the queue as needed.

runtime/compiler/control/CompilationThread.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/CompilationThread.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/CompilationThread.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/CompilationThread.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/CompilationThread.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/HookedByTheJit.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/HookedByTheJit.cpp Outdated Show resolved Hide resolved
runtime/compiler/control/HookedByTheJit.cpp Outdated Show resolved Hide resolved
@dsouzai
Copy link
Contributor Author

dsouzai commented Feb 26, 2024

@mpirvu good for review again.

@mpirvu mpirvu self-assigned this Feb 29, 2024
@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 4, 2024

@mpirvu good for review again.

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpirvu
Copy link
Contributor

mpirvu commented Mar 4, 2024

jenkins test sanity all jdk17

@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 4, 2024

AIX failure due to #8625:

[2024-03-04T16:49:04.498Z] === Output from failing command(s) repeated here ===
[2024-03-04T16:49:04.498Z] * For target buildtools_tools_jigsaw_classes__the.BUILD_JIGSAW_TOOLS_batch:
[2024-03-04T16:49:04.498Z] Sjavac server failed to initialize: Deadlock condition if locked
[2024-03-04T16:49:04.498Z] Process output:
[2024-03-04T16:49:04.498Z] <End of process output>
[2024-03-04T16:49:04.498Z] IOException caught during compilation: Server failed to initialize: Deadlock condition if locked

@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 5, 2024

Failed tests on aarch64,s390x are cmdLineTester_criu_jitPostRestore under -XX:+DebugOnRestore -Xjit:count=0; so I'll need to look into that.

@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 5, 2024

Probably related to the fact that the recompilations are triggered by the sampler thread with the comp monitor in hand, and so because the comp monitor was already in hand, the sampler thread doesn't call exit() enough times before waiting on the queue slot monitor. Will need to investigate to confirm this.

@dsouzai dsouzai marked this pull request as draft March 5, 2024 16:49
@dsouzai
Copy link
Contributor Author

dsouzai commented Mar 5, 2024

The test is failing because on restore, the restoring thread calls triggerCompilationOfFailedCompilationsPreCheckpoint with the comp monitor in hand, which eventually (through a series of calls) calls compileOnSeparateThread, which also acquires the comp monitor at the start. In a sync compile, it will release the comp monitor and wait on the queue slot monitor, but because the monitor's entry count is now still 1, the restoring thread still owns it, causing the deadlock.

I spoke to Marius offline about this, and I think I'm going to need to rearchitect some of the infra, especially when considering future PRs to support Debug on Restore. Converting this PR to a draft until I get the infra work done.

@dsouzai dsouzai force-pushed the fsdRecompPostRestore branch 2 times, most recently from 73f3c6d to eb223cd Compare March 28, 2024 19:57
@dsouzai dsouzai marked this pull request as ready for review March 28, 2024 19:57
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 1, 2024

@mpirvu This PR should be good for review now

Still seeing a hang with -XX:+DebugOnRestore -Xjit:count=0 that I can't explain at all...

@dsouzai dsouzai marked this pull request as draft April 1, 2024 21:10
runtime/compiler/runtime/CRRuntime.cpp Show resolved Hide resolved
runtime/compiler/runtime/CRRuntime.cpp Outdated Show resolved Hide resolved
runtime/compiler/runtime/CRRuntime.cpp Outdated Show resolved Hide resolved
@dsouzai dsouzai marked this pull request as ready for review April 5, 2024 13:32
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 5, 2024

@mpirvu good for review now; #19260 fixed the hang issue and I've addressed the review comments.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 9, 2024

@mpirvu review reminder.

Copy link
Contributor

@mpirvu mpirvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpirvu
Copy link
Contributor

mpirvu commented Apr 9, 2024

jenkins test sanity all jdk17

@mpirvu
Copy link
Contributor

mpirvu commented Apr 10, 2024

A timeout on aarch64

22:49:05  Testing: Create and Restore Criu Checkpoint Image once - TestDelayedOperations
22:49:05  Test start time: 2024/04/10 02:49:02 Coordinated Universal Time
22:49:05  Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/jdkbinary/j2sdk-image/bin/java " -XX:+DebugOnRestore -Xjit:count=0 " org.openj9.criu.TestDelayedOperations 1 1 false false
22:49:05  Time spent starting: 6 milliseconds
22:54:12  ***[TEST INFO 2024/04/10 02:54:02] ProcessKiller detected a timeout after 300000 milliseconds!***
22:54:12  ***[TEST INFO 2024/04/10 02:54:02] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:54:12  GDB OUT No shared libraries loaded at this time.
22:54:12  INFO: Running '/usr/bin/gdb' failed with rc = 1
22:54:12  GDB ERR Could not attach to process.  If your uid matches the uid of the target
22:54:12  GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:54:12  GDB ERR again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
22:54:12  GDB ERR ptrace: Operation not permitted.
22:54:12  GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:54:12  GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:54:12  GDB ERR The program has no registers now.
22:54:12  
22:54:12  INFO: Sleep for 60000 millis before next capture.
22:55:08  ***[TEST INFO 2024/04/10 02:55:03] executing /usr/bin/gdb -batch -x /tmp/debugger12621056414577160408.txt bash 1004471***
22:55:08  GDB OUT No shared libraries loaded at this time.
22:55:08  INFO: Running '/usr/bin/gdb' failed with rc = 1
22:55:08  GDB ERR Could not attach to process.  If your uid matches the uid of the target
22:55:08  GDB ERR process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
22:55:08  GDB ERR again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
22:55:08  GDB ERR ptrace: Operation not permitted.
22:55:08  GDB ERR /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_Personal_testList_0/aqa-tests/TKG/output_17127138677385/cmdLineTester_criu_nonPortableRestore_10/1004471: No such file or directory.
22:55:08  GDB ERR /tmp/debugger12621056414577160408.txt:2: Error in sourced command file:
22:55:08  GDB ERR The program has no registers now.

On x86 there are many failures like:

01:47:56  Testing: Envvar test6
01:47:56  Test start time: 2024/04/10 01:47:56 Eastern Standard Time
01:47:56  Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java " -Xjit -XX:+CRIURestoreNonPortableMode  -Xtrace:print=j9vm.735" org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1 false false
01:47:56  Time spent starting: 7 milliseconds
01:48:00  Time spent executing: 3222 milliseconds
01:48:00  Test result: FAILED
01:48:00  Output from test:
01:48:00   [OUT] start running script
01:48:00   [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
01:48:00   [OUT] export LD_BIND_NOT=on
01:48:00   [OUT] /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport  -Xjit -XX:+CRIURestoreNonPortableMode  -Xtrace:print=j9vm.735 -cp /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_x86-64_linux_Personal_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.EnvVarFileTest EnvVarFileTest6 1
01:48:00   [OUT] WARN: File already exists but should not
01:48:00   [OUT] Pre-checkpoint
01:48:00   [OUT] main: Wed Apr 10 01:47:56 EDT 2024, Performing CRIUSupport.checkpointJVM(), System.currentTimeMillis(): 1712728076866, System.nanoTime(): 29509063139907681
01:48:00   [OUT] initiate restore
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1840): prctl failed @1840 with -1
01:48:00   [OUT] pie: 14762: Error (criu/pie/restorer.c:1841): prctl failed @1841 with -1

@mpirvu
Copy link
Contributor

mpirvu commented Apr 10, 2024

@dsouzai is the timeout on aarch64 of concern?

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 10, 2024

I wouldn't have thought so, but the two hangs happen under -XX:+DebugOnRestore -Xjit:count=0, so will need to look into it I guess.

On shutdown, a compilation is aborted via the
TR::CompilationInterrupted exception. However, the error code associated
with this exception (compilationInterrupted) will result in the
compilation being retried.

In a synchronous compilation, it is possible for a thread to remaining
waiting on the queue slot monitor; it would only be notified if the
compilation was not retried. Essentially the following is possible in a
synchronous compilation:

* Requesting Thread adds method to be compiled
* Requesting thread waits on the queue slot monitor
* Compilation Thread picks up the entry, and starts compiling
* Shutdown Thread starts shutdown process which:
  * Interrupts compilations in progress
  * Sets the _compilationThreadState to COMPTHREAD_SIGNAL_TERMINATE
* Compilation Thread aborts compilation via TR::CompilationInterrupted
* Compilation Thread calls shouldRetryCompilation which:
  * Returns true for compilationInterrupted
* Compilation Thread requeues entry, does not notify waiting Requesting
  Thread
* Compilation Thread exits loop in
  TR::CompilationInfoPerThread::processEntries
* Requesting Thread remains waiting on the queue slot monitor

This commit updates shouldRetryCompilation to return false if the JVM is
shutting down.

Signed-off-by: Irwin D'Souza <[email protected]>
Post-restore, recompile all methods that were compiled using FSD to
ensure steady state throughput is not impacted by the code quality of FSD
code. Additionally, attempt compilation of all methods that failed first
time compilation, as the failure is more than likely due to the
constraints imposed by FSD compilation (e.g., JNI methods).

Signed-off-by: Irwin D'Souza <[email protected]>
…estore

Disable sample based recompilation pre-checkpoint when
-XX:+DebugOnRestore is specified as it does not provide any benefit and
can complicate the infrastructure.

Signed-off-by: Irwin D'Souza <[email protected]>
@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 11, 2024

Latest force push closed quite a few number of holes such as:

  • Remove methods from various memoized lists in Post Restore Options processing when methods are filtered
  • Purge memoized lists at shutdown
  • Don't remember failed compilation during shutdown
  • Don't retry compilation during shutdown

Ultimately, the aarch64 issue was due to the CRRuntime Thread never being notified at shutdown.

@@ -2412,6 +2430,10 @@ void jitClassesRedefined(J9VMThread * currentThread, UDATA classCount, J9JITRede
TR_ASSERT(0,"JIT HCR should make all methods recompilable, so startPC=%p should have a persistentBodyInfo", startPC);
}
}

#if defined(J9VM_OPT_CRIU_SUPPORT)
compInfo->getCRRuntime()->removeMethodsFromMemoizedCompilations<J9Method>(staleMethod);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because classes are redefined in place, I'm not 100% sure if the stale J9Methods point to the redefined class, or if they've been updated, So I used the J9Method variant of removeMethodsFromMemoizedCompilations here.

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 11, 2024

jenkins test sanity alinux64 jdk17

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 11, 2024

@mpirvu good for review again

@mpirvu
Copy link
Contributor

mpirvu commented Apr 11, 2024

jenkins test sanity all jdk21

@dsouzai
Copy link
Contributor Author

dsouzai commented Apr 12, 2024

x86 failures seem to be the prctl failed @1839 with -1 issue; should finally be good for merging 🙌🏾

@mpirvu
Copy link
Contributor

mpirvu commented Apr 12, 2024

xlinux had a bunch of CRIU related failures with

[OUT] pie: 11648: Error (criu/pie/restorer.c:1839): prctl failed @1839 with -1

This is a known issue, so I am going to merge this PR.

@mpirvu mpirvu merged commit 40dcb30 into eclipse-openj9:master Apr 12, 2024
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants