-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jdk8] SPECjvm 2008 tests won't run #3733
Comments
The attached files are only the global log file (for each of 2 separate processes: not sure why 2 were posted, with different naming schemes?): all the thread logfiles are missing. Both global logs do not show any kind of DR-detected error or DR-aware process exit: did the process exit? If so, what was the exit code? Why did it exit? The output above just ends in the "get_memory_info mismatch" line -- what happens after that? Is it still running? Please describe the full symptoms as it is not clear what is happening.
I don't see much memory at all being used before the end of the log file based on the reduction in free blocks inside DR's vmm:
|
two log files are for two different runs. I can look for thread log files and send them. |
application stops after 60 sec (used -s 60). Java PID => 65189. attached all log files in the zipped folder. I tried to grep for the error but there are no error messages. I stopped the app after 60 sec cause it runs for a long time and doesn't start the warm-up phase of the app (mentioned in the previous comment). The output above just ends in the "get_memory_info mismatch" line -- this is output for first 60 sec. There is no output msg on console for another few mins. |
-loglevel 3 is expected to be slow. I believe your 60-second time killed the debug run before it hit any errors. The process logs look normal: no errors, just truncated. The debug run needs a longer timeout. java.65189.00000000.zip contains an empty directory. If it seems hung, the typical approach is to attach a debugger and get callstacks. |
log directory tarball is ~250MB. Please suggest how to share it. |
Hi. We are trying to investigate crashes inside jvm under dynamorio. https://groups.google.com/u/0/g/dynamorio-users/c/hSgenyAM5gM
Also I'ver tried to reuse arrays example from #2989
Could you tell me if there are some techniques to minimize dynamorio optimization (like disable tracing, maybe anything else?) How could I collect logs to stay instrumentation alive? Maybe you could share some debug tricks here? Thanks, Kirill |
I would try (prob all at once):
See https://dynamorio.org/page_debugging.html. Use debug build and logging https://dynamorio.org/page_logging.html to help see what is going on. Try to diff a passing and crashing arrays run using logging: if app is deterministic enough can do direct control flow comparison and then machine state comparison at branch divergence points. |
Hi, Derek.
Thanks, Kirill |
A local variable holding a signal context in DR's signal handler or functions it calls. This would only be relevant if inside the signal handling code and wanting to look at the interrupted context. |
Hi,
after pausing the process just before crash (inserted dr_sleep() in DynamoRIO code right after log printing string) and attaching the process using hsdb debugger to have JVM related heap addresses annotated by JVM internal information. Could you share some BKMs or ideas about how to unwind execution back from faulting instruction to find register context divergence point and then hypothesize about the possible reasons of divergence? It seems like we have got control over DynamoRIO interpretation and translation in runtime so the next step is to learn how to extract the failure reasons from the logs and memory of JVM process using hsdb. Thanks, |
As mentioned, the best case is to have one successful run to that point to compare against, with no non-determinism. Then the first branch divergence point is found, and then the machine state dumped at Having the final mangled in-cache instruction list plus the machine state ( If there is no true code modification (i.e., the JIT is always appending and never re-writing memory that had one instance of code with a new instance), I would try the options from above about disabling the cache consistency/selfmod handling to try to isolate which aspect of DR's handling has the problem. I would look at code marked as "selfmod" (writable) which has a lot of instrumentation added to detect true self-modifying code: there could be bugs lurking in there. But if Other than cache consistency, other possible areas where bugs could lurk:
Running on AArch64, if easy to do, could be a way to try to compare w/ different cache consistency schemes. |
Thanks for advices. We will check thru the list. I appreciate that much. |
Code transformations performed by DR itself (as opposed to a tool/client) to maintain control are called "mangling". They occur last just prior to emitting into the code cache. |
ok. thanks! |
Hi, Answering some of the questions above:
Meanwhile we managed to reproduce translation failure below on SPECjvm2008 compress benchmark: <Resetting caches and non-persistent memory @ 720191 fragments in application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674).> DynamoRIO log around the place of failure is like this: d_r_dispatch: target = 0x00007f32112e1743 The issue happens at JIT compiled code which is like this: 0x00007f32112e173e: test %rax,%rax ;; B2: # N33 <- B1 Freq: 1 IDom: 1/#3 RegPressure: 1 IHRP Index: 2 FRegPressure: 0 FHRP Index: 2 0x00007f32112e1743: add $0x20,%rsp 0x00007f32112e1756: mov $0xffffff65,%esi computed target read 0x00007f3468c58000 looks like global data at JVM: Suspicions logging from DynamoRIO looks like this:
Any ideas on why cache pc is 4 bytes away from faulting instruction whereas the distance between the instructions at generated JIT code is 3 bytes? Below is the example, from the same log, of successful translation of JIT code accessing the same faulting address: Entry into F529662(0x00007f321127a40a).0x00007f3425bd144d (shared) master_signal_handler: thread=2727675, sig=11, xsp=0x00007f3224ffb9b8, retaddr=0x00007f3468f395ab -------- indirect branch target entry: -------- Going to receive signal now Exit from asynch event Original JIT code: 0x00007f321127a409: hlt ;*synchronization entry 0x00007f321127a40a: cmp %edx,%esi 0x00007f321127a411: add $0x10,%rsp |
DR should handle it just fine. It does require a local spilled register for far-away addresses. Maybe there could be fencepost errors on reachability or scratch register issues but rip-rel accesses are quite common so you would think such problems would not be limited to Java.
I would run with -no_enable_reset to eliminate the complexity of resets.
Translating state inside selfmod-sandboxed code is definitely an area where there could well be bugs.
That is likely the bug causing the problems (or one of them): leading to incorrect register restoration or something.
So the case where the same address does successfully translate has it as a regular (non-selfmod) fragment right? So it increasing looks like a selfmod translation problem. I would create a tiny app with this precise block that modifies code on the same page (or, put it on the stack) so it gets marked selfmod and see if you can repro the translation problem and then easily run it repeatedly (b/c it's so small) w/ increasing diagnostics/logging/debugging. |
Does local spilled register mean a) preserve a register value on stack or somewhere in memory, b) put far address into the register c) implement equal instructions using value addressed via the register d) restore original value of the register back?
BTW, we spotted one more failure that happens after SIGUSR2 translation resulting in "Exit due to proactive reset" message in DR logs. Could you please elaborate more on what reset means in DRIO context and what it does? References to docs clarifying this term are appreciated as well. Entry into F329048(0x00007fdcc1252a62).0x00007fded6d565ed (shared) save_fpstate d_r_dispatch: target = 0x00007fdcc1252a62 Entry into F329048(0x00007fdcc1252a62).0x00007fded6d565ed (shared) fcache_enter = 0x00007fded5bbed00, target = 0x00007fded6d565ed master_signal_handler: thread=2655048, sig=11, xsp=0x00007fdcd60d89b8, retaddr=0x00007fdf1a0585ab computing memory target for 0x00007fded6d565f2 causing SIGSEGV, kernel claims it is 0x0000000000000293
Failed code address is not the same but instructions in those two failed and succeeded code addresses are quite similar. Physically they are different pieces of code, in different JIT compiled methods, but the pieces implement the same logic of checking value at the same global address located in JVM (static code implemented in C++, libjvm.so).
Yep, makes sense. Following that approach already. Thanks!!! |
Also xref using JVM annotations to avoid DR having to worry about true self-modifying code and having to use complex instrumentation to handle code changes on must-remain-writable pages: #3502 with some experimental code for an academic paper that was never merged into the main branch. |
Yes, but the app stack is of course unsafe to use: thread-local storage via segment ref.
No, on whether it will reach or not (have to figure out ahead of time before place in code cache at final location). But I doubt there are bugs relating to rip-rel.
"Reset" is an internal DR feature where it deletes all memory that can be re-created later: mostly code caches and associated heap. Kind of a garbage collection. Used to save memory and throw out cold code. I don't think there are any docs outside of the code itself. |
Observed memory corruptions in compiled nmethod (data and code) that JVM compiler thread (static code) tries to install from stack into JVM code cache on heap. To do that installation the thread calls libc memcpy() and executes following code. %rsi points to stack, %rdi points to heap. 0x7fcc635737ec: vmovdqu (%rsi),%ymm0 Instruction at 0x7fcc63573817 causes multiple SIGSEGVs under dynamorio and the instruction restarts again and again. At some that restarting gets two SIGUSR2 signals in a row (JVM employs SIGUSR2 by default to suspend/resume threads in JVM) and that ends up in zeroing vector registers in hw context of the thread and the compiled code becomes corrupted while being copied: Before handling chained SIGUSR2 signals: d_r_dispatch: target = 0x00007fcc63573817 After handling chained SIGUSR2 signals: _r_dispatch: target = 0x00007fcc63573817 Part of log around handling SIGUSR2 signals follows: d_r_dispatch: target = 0x00007fcc63573817 d48b4800f0e483487e6aba49 master_signal_handler: thread=2868191, sig=12, xsp=0x00007fca23b009b8, retaddr=0x00007fcc63b1c5ab save_fpstate d_r_dispatch: target = 0x00007fcc63573817 Log of handling SEGV follows: Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared) fcache_enter = 0x00007fcc1f682d00, target = 0x00007fcc202661f4 master_signal_handler: thread=2868191, sig=11, xsp=0x00007fca23b009b8, retaddr=0x00007fcc63b1c5ab building bb instrlist now ********************* interp: start_pc = 0x00007fcc63573817 setting cur_pc (for fall-through) to 0x00007fcc63573846 bb ilist after mangling: done building bb instrlist ********************* vm_area_remove_fragment: entry 0x00007fca24864668 executable areas: thread areas: Exit from fragment via code mod d_r_dispatch: target = 0x00007fcc63573817 |
Did not look in detail at logs but two thoughts:
There was a bug fixed recently where DR would incorrectly nest a signal when the app did not set SA_NODEFER: #4998. Maybe worth testing w/ that fix if the issue seems to involve DR nesting and the app not handling nesting. Second thought: DR uses SIGUSR2 for suspending threads. Maybe try swapping that to SIGUSR1 just to see if the issue involves a bug in DR's attempt to separate its own use from the app's. If neither of those -- is ithe issue in app SIMD state discrepancy between signal queueing and delivery? |
It makes sense if the fix is in build newer than DynamoRIO-Linux-8.0.18611-1. Is it?
It looks so. Before SIGUSR2s delivery the app SIMD state is valid and after that it is nullified. Diff in handling second SUGUSR2 looks like this: handle_suspend_signal: suspended now save_fpstate BTW, is it expected that store instruction accessing heap memory causes multiple SEGVs? I suppose it is a method to detect selfmod code? |
It is in 8.0.18824
Try a non-itimer-associated signal? Thinking of #5017.
That looks like DR treating a SIGUSR2 as coming from itself rather than the app.
I would again disable reset to keep things simple. A reset will use SIGUSR2 to suspend all the threads.
A store accessing code-containing memory: yes. DR's invariant is that all code is either read-only or sandboxed, so it keeps it read-only and handles the fault on a write by the app. There is a threshold of fault instances at which point it will bail on using page protection and switch to sandboxing. |
With this build that memory corruption have not reproduced since yesterday. Continuing to reproduce other issues.
Clarified that our scenario JVM doesn't use this signal either.
Correct. Logs say that some other app thread translating futex syscall sends these two SIGUSR2s in a row.
Ok. Clear. JVM generates a lot of dynamic code regions. Some part of that regions can then be moved around in memory as well as patched in place (relocations, optimizations like https://en.wikipedia.org/wiki/Inline_caching). Explicit notification of DynamoRIO framework by JVM about its dynamic code layout change could possibly simplify complexity of DynamoRIO implementation for some tricky corner cases as well as reduce runtime overhead in general (should be measured though). |
Yes, we had an academic paper on this and a branch in the code base: but there have not been resources to merge the branch into the mainline. See https://dynamorio.org/page_jitopt.html |
Hi @derekbruening. |
FAKE_TAG should never be the target of execution: the target_delete IBL entry is used on IBL removal; a special handler handles NULL. The tag_fragment field is the app pc, not the code cache pc, so it is never an execution target. So this does not make sense to me. Ideally we could enable LBR and read the last N branches from within gdb to see exactly how -1 was reached. |
Did you mean gdb could setup and read lbr msrs? Or how does gdb read them? |
try to reproduce with debug logs to understand the fragment chain. |
Still could not reproduce hang in debug. no one fragment was linked with F204780(0x0000ffff9eba6e40)
|
run the same workload under gdb with breakpoints in pthrea_mutex_lock monitor region try to catch 1000001 times. all stores had loads.
|
Not reproducing in debug reminds me of some tests that are hanging in release on AArch64 but never hang in debug: #4928, e.g. We were going to try to figure that out soon; maybe we can get lucky and it will be the same underlying problem as here. |
For your logs in #3733 (comment) the explanation is this line:
So DR suspended a thread in between the ldaxr and stxr and redirected it to start executing at a new block that tail-duplicates the original. So dynamically there was a ldaxr before the stxr; DR just made a split block for the suspend-and-relocate. |
Hi, @derekbruening. |
Yes PR #5367 fixes one hang we found that reproduced in release build but not debug (just b/c of timing). There are more though: drcachesim online (#4928) and there are some code-inspection issues #2502. Still, it is worth trying with the PR #5367 patch that was just merged to see if that helps these Java apps. |
Remove all my workarounds (prohibition splitting inside monitor region nad so on) and apply this patch. Had one hang on 2000 runs. Previous was about 2-3 hangs on 100 time run. |
Hi.
Kirill |
Hi @derekbruening java report
crash context
The same SIGSEGV in DynamoRIO logs
register context is the same like java reported
Basic Block and mangling bb
Look at crashes instruction
BUT let's look at the previous instruction
So, ldrsw instruction must set
Does DRIO make some internal job here? What could be wrong? |
oh, ldrsw is signed, so x2 could be 0xffffffff945d8290 |
Hi @derekbruening.
pc is code cache for the bb looks like
clear bb and bb after mangling
So, pc When the thread was awake, dispatcher set target
Building new bb
Looks like we back to 1st original instruction crash signal context
Do I understand correctly the following:
is it possible that we could not restore context? |
You would expect this to be marked as a mangling epilogue. Translation in a mangling epilogue is supposed to target the next instruction and "emulate" the rest of the epilogue, as it is sometimes impossible to undo the app instr and thus returning the being-mangled instr PC for restart is not feasible. This makes it look like that is not done correctly for stolen register mangling on AArch64. I would suggest filing a separate issue to focus on this. |
|
It will be great to have some workaround here. this crash is reproduced too often on the pool of our jvm workloads. (( |
Hi, @derekbruening. log example
|
Hi, @derekbruening |
That is great news. I've sent you an invite for commit privileges so you can create your own branches. Normally we create a new temporary branch for each PR. |
@kuhanov I was curious if you are still planning to contribute your changes |
Hi. In general we switched to drcachesim collector. It is more stable and provide offline data for analysis. |
But weren't all the issues you hit and the fixes you were going to contribute relating to the core of DR and so would be present in the drcachesim drmemtrace tracer too? |
ok. I looked in our branch and investigated what we have for core and ext in our local branch There are 3 types of patches:
fixes:
Workarounds. This is not product solution but these unblocked us for collecting data for java (we had limited resources to invetigate that deeper)
I suppose, we could share these patches, maybe these could be added to DRIO project backlog Thanks, Kirill |
Please share any bug fixes: otherwise someone else may hit the same problem and spend essentially wasted time re-debugging and re-fixing what is already sitting fixed in a private branch somewhere which is not a good situation. We ourselves may start running Java in the future and would not want to have to re-discover and re-fix all these things. |
ok. I'll prepare review requests to have ability to link on them. maybe is there better way to share our patches? |
Thank you. I think a PR is good even for the ones labeled workarounds where you're not sure if it's the proper long-term approach. |
https://github.com/DynamoRIO/dynamorio/tree/i3733-jvm-bug-workarounds |
Thanks. At a quick glance we have 8 changes: https://github.com/DynamoRIO/dynamorio/tree/i3733-jvm-bug-workarounds
|
we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:
with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small
attached log debuglevel 3 for the java pid
java.log.zip
java.0.59824.zip
The text was updated successfully, but these errors were encountered: