Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU TR::CompilationInfo::prepareForCheckpoint causes hang intermittently at single thread mode #15191

Closed
JasonFengJ9 opened this issue Jun 1, 2022 · 21 comments · Fixed by #15202
Assignees
Labels
comp:jit criu Used to track CRIU snapshot related work

Comments

@JasonFengJ9
Copy link
Member

While working on a testcase for Prevent contended monitor enters at CRIU single thread mode, it hangs intermittently, and the culprit thread has following stacktrace:

#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f6fe00016d4) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f707c07c1b8, cond=0x7f6fe00016a8) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7f6fe00016a8, mutex=0x7f707c07c1b8) at pthread_cond_wait.c:638
#3 0x00007f70828c6ac2 in monitor_wait_original (self=0x7f6fe0001238, monitor=0x7f707c07c138, millis=0, nanos=0, interruptible=0)
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4686
#4 0x00007f70828c5673 in monitor_wait (monitor=0x7f707c07c138, millis=0, nanos=0, interruptible=0) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4531
#5 0x00007f70828c5554 in omrthread_monitor_wait (monitor=0x7f707c07c138) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4401
#6 0x00007f70815b0928 in J9::Monitor::wait (this=0x7f707c0dc6e0) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/compiler/infra/J9Monitor.cpp:79
#7 0x00007f70814956f2 in TR::CompilationInfo::waitOnCRMonitor (this=0x7f707c0db810) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/compiler/control/CompilationThread.cpp:1461
#8 0x00007f7081498132 in TR::CompilationInfo::prepareForCheckpoint (this=0x7f707c0db810) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/compiler/control/CompilationThread.cpp:2665
#9 0x00007f70814bf11e in jitHookPrepareCheckpoint (hookInterface=0x7f707c011ba8, eventNum=101, eventData=0x7f705c4081f0, userData=0x0)
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/compiler/control/HookedByTheJit.cpp:1840
#10 0x00007f70828d96d2 in J9HookDispatch (hookInterface=0x7f707c011ba8, taggedEventNum=101, eventData=0x7f705c4081f0) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/util/hookable/hookable.cpp:235
#11 0x00007f70829d6659 in jvmCheckpointHooks (currentThread=0x19d300) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/CRIUHelpers.cpp:66
#12 0x00007f705c3c717f in Java_org_eclipse_openj9_criu_CRIUSupport_checkpointJVMImpl (env=0x19d300, unused=0x1fb630, imagesDir=0x209fa0, leaveRunning=0 '\000', shellJob=1 '\001', extUnixSupport=0 '\000',
  logLevel=0, logFile=0x0, fileLocks=1 '\001', workDir=0x209f68, tcpEstablished=1 '\001', autoDedup=0 '\000', trackMemory=0 '\000', unprivileged=0 '\000')
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/criusupport/criusupport.cpp:389
#13 0x00007f7082bde60e in ffi_call_unix64 () at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/libffi/x86/unix64.S:105
#14 0x00007f7082bdde7f in ffi_call_int (cif=0x7f705c408860, fn=0x7f70828b1330, rvalue=0x19d420, avalue=0x7f705c408900, closure=0x0)
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/libffi/x86/ffi64.c:672
#15 0x00007f7082bddf00 in ffi_call (cif=0x7f705c408860, fn=0x7f70828b1330, rvalue=0x19d420, avalue=0x7f705c408900) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/libffi/x86/ffi64.c:691
#16 0x00007f70829e7efb in VM_BytecodeInterpreterCompressed::cJNICallout (this=0x7f705c4091b0, _sp=@0x7f705c409120: 0x209f20, _pc=@0x7f705c409128: 0x7 <error: Cannot access memory at address 0x7>,
  receiverAddress=0x1fb630, javaArgs=0x209fa8, returnType=0x7f705c408af3 "", returnStorage=0x19d420, function=0x7f70828b1330, isStatic=true)
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/BytecodeInterpreter.hpp:2508
#17 0x00007f70829e78f7 in VM_BytecodeInterpreterCompressed::callCFunction (this=0x7f705c4091b0, _sp=@0x7f705c409120: 0x209f20, _pc=@0x7f705c409128: 0x7 <error: Cannot access memory at address 0x7>,
  jniMethodStartAddress=0x7f70828b1330, receiverAddress=0x1fb630, javaArgs=0x209fa0, bp=0x7f705c408b00, isStatic=true, returnType=0x7f705c408af3 "")
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/BytecodeInterpreter.hpp:2326
#18 0x00007f70829e7140 in VM_BytecodeInterpreterCompressed::runJNINative (this=0x7f705c4091b0, _sp=@0x7f705c409120: 0x209f20, _pc=@0x7f705c409128: 0x7 <error: Cannot access memory at address 0x7>)
  at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/BytecodeInterpreter.hpp:2218
#19 0x00007f7082a04213 in VM_BytecodeInterpreterCompressed::run (this=0x7f705c4091b0, vmThread=0x19d300) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/BytecodeInterpreter.hpp:10557
#20 0x00007f70829d7f79 in bytecodeLoopCompressed (currentThread=0x19d300) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/BytecodeInterpreter.inc:112
#21 0x00007f7082b203e2 in c_cInterpreter () at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/build/linux-x86_64-normal-server-release/vm/runtime/vm/xcinterp.s:168
#22 0x00007f70828fe6ff in runJavaThread (currentThread=0x19d300) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/callin.cpp:656
#23 0x00007f70829d3ddc in javaProtectedThreadProc (portLibrary=0x7f7082cecee0 <j9portLibrary>, entryarg=0x19d300) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/vmthread.cpp:2081
#24 0x00007f70826fb7f3 in omrsig_protect (portLibrary=0x7f7082cecee0 <j9portLibrary>, fn=0x7f70829d3c93 <javaProtectedThreadProc(J9PortLibrary*, void*)>, fn_arg=0x19d300,
  handler=0x7f708293d810 <structuredSignalHandler>, handler_arg=0x19d300, flags=506, result=0x7f705c409db0) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/port/unix/omrsignal.c:425
#25 0x00007f70829cf17b in javaThreadProc (entryarg=0x7f707c00f690) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/openj9/runtime/vm/vmthread.cpp:360
#26 0x00007f70828c0c09 in thread_wrapper (arg=0x7f6fe0001238) at /root/jdk11-criustm-0531/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:1724
#27 0x00007f7082d56609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#28 0x00007f7082eb2163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

fyi @dsouzai @tajila

@JasonFengJ9 JasonFengJ9 added the criu Used to track CRIU snapshot related work label Jun 1, 2022
@tajila tajila added the comp:jit label Jun 1, 2022
@tajila
Copy link
Contributor

tajila commented Jun 1, 2022

@dsouzai please take a look at this

@dsouzai
Copy link
Contributor

dsouzai commented Jun 1, 2022

@JasonFengJ9 do you mind sending me the diagnostic info (and the jdk)?

@JasonFengJ9
Copy link
Member Author

@dsouzai pls find attached test.zip
The hang can be reproduced w/ a recent criu build

java -XX:+EnableCRIUSupport --add-exports java.base/openj9.internal.criu=ALL-UNNAMED -Denable.j9internal.checkpoint.hook.api.debug TestSingleThreadMode

It is an intermittent problem, ~1/30.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 1, 2022

I was able to reproduce a hang, but I don't see a thread waiting in TR::CompilationInfo::waitOnCRMonitor. I see this:

2LKMONINUSE      sys_mon_t:0x00007F234C417998 infl_mon_t: 0x00007F234C417A18:
3LKMONOBJECT       java/lang/Object@0x000000070629C688: owner "Thread-491" (J9VMThread:0x0000000000A3A700), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "Thread-3" (J9VMThread:0x0000000000195D00)
3LKWAITER                "Thread-4" (J9VMThread:0x00000000001C2700)
3LKWAITER                "Thread-6" (J9VMThread:0x00000000001CA000)
3LKWAITER                "Thread-7" (J9VMThread:0x00000000001D8D00)
3LKWAITER                "Thread-8" (J9VMThread:0x00000000001C3300)
3LKWAITER                "Thread-9" (J9VMThread:0x00000000001C3F00)
3LKWAITER                "Thread-11" (J9VMThread:0x00000000001C5700)
3LKWAITER                "Thread-12" (J9VMThread:0x00000000001ECB00)
3LKWAITER                "Thread-13" (J9VMThread:0x00000000001F0700)
3LKWAITER                "Thread-14" (J9VMThread:0x00000000001F4400)
3LKWAITER                "Thread-15" (J9VMThread:0x00000000001F8000)
3LKWAITER                "Thread-16" (J9VMThread:0x00000000001FBC00)
3LKWAITER                "Thread-17" (J9VMThread:0x00000000001FF900)
3LKWAITER                "Thread-18" (J9VMThread:0x0000000000203500)
3LKWAITER                "Thread-19" (J9VMThread:0x0000000000207100)
3LKWAITER                "Thread-20" (J9VMThread:0x000000000020AE00)
3LKWAITER                "Thread-21" (J9VMThread:0x000000000020EA00)
3LKWAITER                "Thread-23" (J9VMThread:0x0000000000216300)
...
2LKREGMON          Thread public flags mutex lock (0x00007F234CADE608): <unowned>
3LKNOTIFYQ            Waiting to be notified:
3LKWAITNOTIFY            "Thread-491" (J9VMThread:0x0000000000A3A700)

I have a couple of questions:

  1. What exactly does "single thread mode" mean, and what does that mean in terms of things like vmaccess?
  2. @JasonFengJ9, in the coredump you looked at where it was hanging, what do the stack traces of the comp threads look like? When the hook thread is waiting on the CRMonitor, it's waiting to be notified by any active comp threads so looking at what the comp threads are doing might provide some more information.

As it stands, it doesn't look like it's the CR Monitor to blame, but I'll try to reproduce some more to see.

@JasonFengJ9
Copy link
Member Author

What exactly does "single thread mode" mean, and what does that mean in terms of things like vmaccess?

In this mode, only the thread performing Checkpoint is expected to run, jvmCheckpointHooks, jvmRestoreHooks, until exiting this mode. It acquire/release vmaccess as usual. During that period, this thread is the only Java thread running, and can't invoke monitor_wait, relinquish the ownership and wait for another Java thread notification which won't happen and cause hang.

in the coredump you looked at where it was hanging

It was from a live process attached, I can try reproducing and get the stacktraces of the JIT threads.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Annoyingly enough, I can't reproduce this when I build the JVM locally (even with the same OMR, OpenJ9, and extensions repo SHAs).

I think I do see one problem in the JIT that can manifest itself as the thread waiting on the CR Monitor. If there are comp threads that are "active" but waiting on work, they wait on the Comp Monitor. When the hook thread tries to suspend all the comp threads, it doesn't send a notify on the Comp Monitor, so it's possible for it to wait forever while the "active" comp threads sit waiting to be notified to compile something (which never happens because no java code is running anymore).

I'll open a PR to fix that, which I believe is the problem seen in the stack trace in this issue. However, there might be a second issue because in my env I don't ever see anyone waiting on the CR Monitor. I do see that one of the comp threads is waiting to be notified on the comp monitor, but it doesn't look like we even get to the jit hook where it would hang waiting on the CR monitor.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Got lucky and managed to get a hang with my local build which contains the fix for the thread waiting on the comp monitor. It doesn't look we ever got to the JIT hook.

I'm now more convinced there are two problems; I don't think the second problem is caused by the JIT.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Opened #15202 to fix the JIT issue.

@JasonFengJ9
Copy link
Member Author

I'm now more convinced there are two problems; I don't think the second problem is caused by the JIT.

The testcase intends to create monitor contention leading to a hang which is expected to be discovered by #14651, and throw RestoreException instead. However it can't stop the hang due to this issue.
For some reasons, my environment can't reproduce a hang w/ the JIT stacktrace in question, glad @dsouzai find the cause.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Ah I see, so should I just have #15202 close this issue automatically?

@JasonFengJ9
Copy link
Member Author

so should I just have #15202 close this issue automatically?

Yes, pls do.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Based on #15191 (comment) I decided to pull in the changes from #14651. With those changes, I'm able to reproduce the original problem.

While the issue of the comp thread waiting on the comp monitor is, I believe, valid, the actual problem here seems to be because a comp thread is compiling, and is blocked waiting for exclusive vm access:

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f5eac0d178c) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f5eac0d178c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007f5eb242579f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f5eac0d178c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007f5eb2427eb0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f5eac006b48, cond=0x7f5eac0d1760) at pthread_cond_wait.c:504
#4  ___pthread_cond_wait (cond=cond@entry=0x7f5eac0d1760, mutex=mutex@entry=0x7f5eac006b48) at pthread_cond_wait.c:619
#5  0x00007f5eb20b57a7 in monitor_wait_original (interruptible=0, nanos=0, millis=0, monitor=0x7f5eac006ac8, self=0x7f5eac0d12f0) at /root/hostdir/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4686
#6  monitor_wait (interruptible=0, nanos=0, millis=0, monitor=0x7f5eac006ac8) at /root/hostdir/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4531
#7  omrthread_monitor_wait (monitor=0x7f5eac006ac8) at /root/hostdir/openj9-openjdk-jdk11/omr/thread/common/omrthread.c:4401
#8  0x00007f5eb214ee24 in acquireExclusiveVMAccess (vmThread=0x2ba00) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/vm/VMAccess.cpp:303
#9  0x00007f5eb1b20121 in jit_artifact_protected_add_code_cache (vm=0x7f5eac00f250, tree=0x7f5eac0bdcb0, cacheToInsert=0x7f5e7000c390, optionalHashTable=0x0) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/jit_vm/artifact.c:86
#10 0x00007f5eb1a60b57 in OMR::CodeCacheManager::allocateCodeCacheFromNewSegment (this=0x7f5eac0c43a0, segmentSizeInBytes=<optimized out>, reservingCompilationTID=reservingCompilationTID@entry=4) at /root/hostdir/openj9-openjdk-jdk11/omr/compiler/runtime/OMRCodeCacheManager.cpp:1211
#11 0x00007f5eb1a60daf in OMR::CodeCacheManager::reserveCodeCache (this=0x7f5eac0c43a0, compilationCodeAllocationsMustBeContiguous=compilationCodeAllocationsMustBeContiguous@entry=false, sizeEstimate=sizeEstimate@entry=0, compThreadID=compThreadID@entry=4, numReserved=numReserved@entry=0x7f5e809819cc) at /root/hostdir/openj9-openjdk-jdk11/omr/compiler/runtime/OMRCodeCacheManager.cpp:356
#12 0x00007f5eb15e9f05 in J9::CodeCacheManager::reserveCodeCache (this=0x7f5eac0c43a0, compilationCodeAllocationsMustBeContiguous=compilationCodeAllocationsMustBeContiguous@entry=false, sizeEstimate=sizeEstimate@entry=0, compThreadID=compThreadID@entry=4, numReserved=numReserved@entry=0x7f5e809819cc) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/compiler/runtime/J9CodeCacheManager.cpp:223
#13 0x00007f5eb1409e36 in TR_J9VMBase::getDesignatedCodeCache (this=0x7f5eac058ff0, comp=0x7f5ddf000000) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/compiler/env/VMJ9.cpp:5523
#14 0x00007f5eb1368879 in J9::CodeGenerator::reserveCodeCache (this=0x7f5ddf005790) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/compiler/codegen/J9CodeGenerator.cpp:4968
#15 0x00007f5eb168d519 in OMR::CodeGenPhase::performAll (this=this@entry=0x7f5ddf005bc0) at /root/hostdir/openj9-openjdk-jdk11/omr/compiler/codegen/OMRCodeGenPhase.cpp:138
#16 0x00007f5eb168a163 in OMR::CodeGenerator::generateCode (this=0x7f5ddf005790) at /root/hostdir/openj9-openjdk-jdk11/omr/compiler/codegen/OMRCodeGenerator.cpp:1406
#17 0x00007f5eb16b7689 in OMR::Compilation::compile (this=this@entry=0x7f5ddf000000) at /root/hostdir/openj9-openjdk-jdk11/omr/compiler/compile/OMRCompilation.cpp:1111
#18 0x00007f5eb13a9fda in TR::CompilationInfoPerThreadBase::compile (this=this@entry=0x7f5eb030cb40, vmThread=vmThread@entry=0x2ba00, compiler=0x7f5ddf000000, compilee=compilee@entry=0x7f5e80985988, vm=..., optimizationPlan=<optimized out>, scratchSegmentProvider=...) at /root/hostdir/openj9-openjdk-jdk11/openj9/runtime/compiler/control/CompilationThread.cpp:9587

I'm not really sure what a good way forward is. We do want to have the ability to compile before performing a checkpoint but I guess jit_artifact_protected_add_code_cache needs exclusive vmaccess, which I guess some other thread holds (or at least some thread still has vmaccess) when we're in single threaded mode.

@tajila @JasonFengJ9 do you guys have any suggestions?

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

Do you think it's safe for the thread that's hooking into the jit to release vmaccess while it waits on the various monitors, and then reacquire it right before we return from the hook (assuming it's the hook thread that has vmaccess)?

@dsouzai
Copy link
Contributor

dsouzai commented Jun 2, 2022

With this change:

diff --git a/runtime/compiler/control/CompilationThread.cpp b/runtime/compiler/control/CompilationThread.cpp
index 036574704..abbbda5bd 100644
--- a/runtime/compiler/control/CompilationThread.cpp
+++ b/runtime/compiler/control/CompilationThread.cpp
@@ -2618,6 +2618,10 @@ void TR::CompilationInfo::prepareForCheckpoint()
    J9JavaVM   *vm       = _jitConfig->javaVM;
    J9VMThread *vmThread = vm->internalVMFunctions->currentVMThread(vm);
 
+   bool hadVMAccess = (vmThread->publicFlags & J9_PUBLIC_FLAGS_VM_ACCESS);
+   if (hadVMAccess)
+      releaseVMAccessNoSuspend(vmThread);
+
    {
    OMR::CriticalSection suspendCompThreadsForCheckpoint(getCompilationMonitor());
 
@@ -2682,6 +2686,9 @@ void TR::CompilationInfo::prepareForCheckpoint()
       }
    }
 
+   if (hadVMAccess)
+      acquireVMAccessNoSuspend(vmThread);
+
    }
 
 void TR::CompilationInfo::prepareForRestore()

I didn't see the issue even after running 500 times. I still have a couple of questions though:

  1. Is it possible for any other threads to have vmaccess at this point?
  2. Is this still a reasonable thing to do when we're not in single threaded mode?

@JasonFengJ9
Copy link
Member Author

Is it possible for any other threads to have vmaccess at this point?

Single Thread Mode stopped all threads that can run Java code, and GC/JIT threads are not affected. So yes, it is possible other threads holding the vmaccess, DDR command threads flags has hints about the thread(s) in question.

Is this still a reasonable thing to do when we're not in single threaded mode?

Does prepareForCheckpoint() really need vmaccess, if so, it might not be safe to do so, hypothetically what about there is a STW GC going on?

@dsouzai
Copy link
Contributor

dsouzai commented Jun 3, 2022

Does prepareForCheckpoint() really need vmaccess, if so, it might not be safe to do so, hypothetically what about there is a STW GC going on?

No, it's the opposite. In the code in #15191 (comment) I'm releasing vmaccess in prepareForCheckpoint because the compilation thread (a different thread) needs exclusive vmaccess when it calls jit_artifact_protected_add_code_cache.

Is it possible for any other threads to have vmaccess at this point?

Single Thread Mode stopped all threads that can run Java code, and GC/JIT threads are not affected. So yes, it is possible other threads holding the vmaccess, DDR command threads flags has hints about the thread(s) in question.

Will they hold on to it indefinitely, or will they release it at some point, because otherwise the compilation thread will block indefinitely waiting for exclusive vmaccess.

Is this still a reasonable thing to do when we're not in single threaded mode?

I was asking is it ok for the hook thread (the thread that calls prepareForCheckpoint) to release vmaccess in non singled threaded mode, or does it not matter.

@JasonFengJ9
Copy link
Member Author

Does prepareForCheckpoint() really need vmaccess, if so, it might not be safe to do so, hypothetically what about there is a STW GC going on?

No, it's the opposite. In the code in #15191 (comment) I'm releasing vmaccess in prepareForCheckpoint because the compilation thread (a different thread) needs exclusive vmaccess when it calls jit_artifact_protected_add_code_cache.

Just to clarify the question about prepareForCheckpoint(), does this JIT helper manipulate objects, i.e., is vmaccess required for its (underneath) operations? jit_artifact_protected_add_code_cache attempting exclusive vmaccess probably can't be guaranteed to proceed at single threaded mode (STM).

Will they hold on to it indefinitely, or will they release it at some point, because otherwise the compilation thread will block indefinitely waiting for exclusive vmaccess.

That depends on what other GC/JIT threads are doing, the compilation thread could be blocked during STM.

I was asking is it ok for the hook thread (the thread that calls prepareForCheckpoint) to release vmaccess in non singled threaded mode, or does it not matter.

Assuming prepareForCheckpoint is invoked via jvmCheckpointHooks->jitHookPrepareCheckpoint which occurs only within STM. Since there might be multiple active GC/JIT threads not affected by STM, I think vmaccess need acquire/release as usual, i.e., STM doesn't ensure safe vmaccess for individual GC/JIT threads.

@dsouzai
Copy link
Contributor

dsouzai commented Jun 3, 2022

So here's what the circumstance looks like:

Comp Thread (C1): compiling some method.

Hook Thread (T1): VM calls jitHookPrepareCheckpoint which calls prepareForCheckpoint:

static void jitHookPrepareCheckpoint(J9HookInterface * * hookInterface, UDATA eventNum, void * eventData, void * userData)
{
J9VMClassesUnloadEvent * restoreEvent = (J9VMClassesUnloadEvent *)eventData;
J9VMThread * vmThread = restoreEvent->currentThread;
J9JavaVM * javaVM = vmThread->javaVM;
J9JITConfig * jitConfig = javaVM->jitConfig;
TR::CompilationInfo * compInfo = TR::CompilationInfo::get(jitConfig);
compInfo->prepareForCheckpoint();
}

T1 may have vmaccess when it's in prepareForCheckpoint; in fact in your test case, it looks like it does because when I release VM access, the test passes without any hangs.

C1 needs exclusive vmaccess at some point during the compilation. However, if any other threads have vmaccess, it's going to have to wait.

When in prepareForCheckpoint, T1 will wait until all the compilation threads have suspended themselves.

C1 is not going to suspend itself until it's done compiling. However, it can't finish compiling until it acquires exclusive vmaccess.

The simple solution I put up is that if T1 has vmaccess, it releases it before waiting for all comp threads to suspend themselves, and then reacquires it before returning back to the VM. The thing I'm not sure about is, is that good enough? The constraint here is, we want to have the ability to compile a bunch of methods before we checkpoint, and as such, we need to allow comp threads to acquire exclusive vmaccess while T1 is waiting in prepareForCheckpoint.

Is my solution of having T1 release vmaccess at the start of jitHookPrepareCheckpoint and reacquire vmaccess at the end of jitHookPrepareCheckpoint OK; namely, are there any assumptions that will be broken if T1 does not always have vmaccess while it's in jitHookPrepareCheckpoint?

@tajila
Copy link
Contributor

tajila commented Jun 3, 2022

T1 may have vmaccess when it's in prepareForCheckpoint; in fact in your test case, it looks like it does because when I release VM access, the test passes without any hangs.

Yes, prepareForCheckpoint is always called with VMAccess. It should be safe to release VMAccess at the start of jitHookPrepareCheckpoint and reaquire again at the end because at this point we have halted all java threads.

@tajila
Copy link
Contributor

tajila commented Jun 3, 2022

Is this only required for jitHookPrepareCheckpoint ? I dont think that would work on the restore side

@dsouzai
Copy link
Contributor

dsouzai commented Jun 3, 2022

Is this only required for jitHookPrepareCheckpoint ? I dont think that would work on the restore side

Yeah this is only needed on the checkpoint side; on the restore side all we do is resume the comp threads to go back to looking for work. We don't allow a checkpoint while a compilation is in flight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit criu Used to track CRIU snapshot related work
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants