[jdk8] SPECjvm 2008 tests won't run #3733

rkgithubs · 2019-07-15T21:06:45Z

we are seeing that SPECjvm 2008 runs won't even start the warm-up phase when launched with drrun. Typically specjvm runs may look like this:

/home/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small

SPECjvm2008 Peak
  Properties file:   none
  Benchmarks:        scimark.sparse.small

with drrun we never get to this first message. I do see two threads running for short period but not convinced runs is successful since it never gets to warm-up and execution phase of the test. Although memory utilization is roughly 11GB which is quite high for sparse.small

/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/drrun -s 60 -debug -loglevel 3 -vm_size 1G -no_enable_reset -disable_traces -- ~/rahul/jdk1.8.0_201/bin/java -jar SPECjvm2008.jar -ikv -wt 15 -it 30 -bt 2 scimark.sparse.small
  
<log dir=/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/bin64/../logs/java.59563.00000000>



<Starting application /root/rahul/jdk1.8.0_201/bin/java (59563)>
<Initial options = -no_dynamic_options -loglevel 3 -code_api -stack_size 56K -signal_stack_size 32K -disable_traces -no_enable_traces -max_elide_jmp 0 -max_elide_call 0 -no_shared_traces -bb_ibl_targets -no_shared_trace_ibl_routine -no_enable_reset -no_reset_at_switch_to_os_at_vmm_limit -reset_at_vmm_percent_free_limit 0 -no_reset_at_vmm_full -reset_at_commit_free_limit 0K -reset_every_nth_pending 0 -vm_size 1048576K -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/root/rahul/DynamoRIO-x86_64-Linux-7.90.18019-0/lib64/debug/libdynamorio.so' 0x00007f2e11bd7580
>
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<(1+x) Handling our fault in a TRY at 0x00007f2e11e20d7c>
<spurious rep/repne prefix @0x00007f2e11994f96 (f2 41 ff e3): >
<writing to executable region.>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00000000491dc000-0x0000000089042000 prot=0x00000000
        cache says: 0x000000004904e000-0x0000000089042000 prot=0x00000000
>

attached log debuglevel 3 for the java pid
java.log.zip

java.0.59824.zip

The text was updated successfully, but these errors were encountered:

derekbruening · 2019-07-16T01:49:27Z

The attached files are only the global log file (for each of 2 separate processes: not sure why 2 were posted, with different naming schemes?): all the thread logfiles are missing.

Both global logs do not show any kind of DR-detected error or DR-aware process exit: did the process exit? If so, what was the exit code? Why did it exit?

The output above just ends in the "get_memory_info mismatch" line -- what happens after that? Is it still running? Please describe the full symptoms as it is not clear what is happening.

Although memory utilization is roughly 11GB which is quite high for sparse.small

I don't see much memory at all being used before the end of the log file based on the reduction in free blocks inside DR's vmm:

$ for i in vmcode vmheap; do grep "^vmm_heap_reserve_blocks.*${i}.*free" java.0.59563.txt | (head -n 1; echo "..."; tail -n 1); done
vmm_heap_reserve_blocks vmcode: size=32768 => 32768 in blocks=8 free_blocks=262144
...
vmm_heap_reserve_blocks vmcode: size=65536 => 65536 in blocks=16 free_blocks=259685
vmm_heap_reserve_blocks vmheap: size=32768 => 32768 in blocks=8 free_blocks=131072
...
vmm_heap_reserve_blocks vmheap: size=262144 => 262144 in blocks=64 free_blocks=125399

derekbruening · 2019-07-16T01:51:00Z

Xref #2989, #3586, #2506

rkgithubs · 2019-07-17T17:16:11Z

two log files are for two different runs. I can look for thread log files and send them.

rkgithubs · 2019-07-18T18:55:45Z

application stops after 60 sec (used -s 60). Java PID => 65189. attached all log files in the zipped folder. I tried to grep for the error but there are no error messages. I stopped the app after 60 sec cause it runs for a long time and doesn't start the warm-up phase of the app (mentioned in the previous comment).
java.65189.00000000.zip

The output above just ends in the "get_memory_info mismatch" line -- this is output for first 60 sec. There is no output msg on console for another few mins.

derekbruening · 2019-07-18T20:08:02Z

-loglevel 3 is expected to be slow. I believe your 60-second time killed the debug run before it hit any errors. The process logs look normal: no errors, just truncated. The debug run needs a longer timeout.

java.65189.00000000.zip contains an empty directory.

If it seems hung, the typical approach is to attach a debugger and get callstacks.

rkgithubs · 2019-07-18T20:37:37Z

log directory tarball is ~250MB. Please suggest how to share it.

kuhanov · 2021-06-01T14:51:08Z

Hi. We are trying to investigate crashes inside jvm under dynamorio. https://groups.google.com/u/0/g/dynamorio-users/c/hSgenyAM5gM
The crashes are reproduced without clients too.
To simplify reproducer I used helloworld benchmark with minimum actions and disabling cache traces
.build/bin64/drrun -disable_traces -- java -jar SPECjvm2008.jar -ikv -i 1 -it 1 -wt 0 -ict helloworld

 A fatal error has been detected by the Java Runtime Environment:

  Internal Error (sharedRuntime.cpp:553), pid=159762, tid=0x00007fe02331e700
  guarantee(cb != NULL && cb->is_nmethod()) failed: safepoint polling: pc must refer to an nmethod

 JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
 Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
 Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

 An error report file with more information is saved as:
 /home/huawei/workspace/SPECjvm2008/build/release/SPECjvm2008/hs_err_pid159762.log

 If you would like to submit a bug report, please visit:
   http://bugreport.java.com/bugreport/crash.jsp

Aborted (core dumped)

Also I'ver tried to reuse arrays example from #2989
It is more stable but have crashes too (1 from 5 runs)

.build/bin64/drrun -disable_traces -- java arrays

 A fatal error has been detected by the Java Runtime Environment:

  SIGSEGV (0xb) at pc=0x00007fdc6176f062, pid=159861, tid=0x00007fd9f7dfd700

 JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10)
 Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
 Problematic frame:
 V  [libjvm.so+0x2ed062]

 Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

 An error report file with more information is saved as:
 /home/huawei/workspace/SPECjvm2008/_my_examples/hs_err_pid159861.log

[error occurred during error reporting , id 0xb]

 If you would like to submit a bug report, please visit:
   http://bugreport.java.com/bugreport/crash.jsp

[thread 140584498194176 also had an error]
Aborted (core dumped)

Could you tell me if there are some techniques to minimize dynamorio optimization (like disable tracing, maybe anything else?) How could I collect logs to stay instrumentation alive? Maybe you could share some debug tricks here?

Thanks, Kirill

derekbruening · 2021-06-01T17:27:44Z

I would try (prob all at once):

-no_hw_cache_consistency
-no_sandbox_writes
-no_enable_reset

See https://dynamorio.org/page_debugging.html. Use debug build and logging https://dynamorio.org/page_logging.html to help see what is going on. Try to diff a passing and crashing arrays run using logging: if app is deterministic enough can do direct control flow comparison and then machine state comparison at branch divergence points.

kuhanov · 2021-06-10T11:11:49Z

I would try (prob all at once):

-no_hw_cache_consistency

-no_sandbox_writes

-no_enable_reset

See https://dynamorio.org/page_debugging.html. Use debug build and logging https://dynamorio.org/page_logging.html to help see what is going on. Try to diff a passing and crashing arrays run using logging: if app is deterministic enough can do direct control flow comparison and then machine state comparison at branch divergence points.

Hi, Derek.
BTW, what is 'sc' from debugging page? How could I get it in gdb during debigging coredump?

(gdb) info symbol **(sc->ebp+4)
get_memory_info + 647 in section .text
(gdb) info symbol **(**(sc->ebp)+4)
check_thread_vm_area + 7012 in section .text
(gdb) info symbol **(**(**(sc->ebp))+4)
check_new_page_start + 79 in section .text
(gdb) info symbol **(**(**(**(sc->ebp)))+4)
build_bb_ilist + 514 in section .text

Thanks, Kirill

derekbruening · 2021-06-10T13:42:55Z

BTW, what is 'sc' from debugging page? How could I get it in gdb during debigging coredump?

A local variable holding a signal context in DR's signal handler or functions it calls. This would only be relevant if inside the signal handling code and wanting to look at the interrupted context.

abudankov · 2021-06-23T14:22:28Z

I would try (prob all at once):

-no_hw_cache_consistency

-no_sandbox_writes

-no_enable_reset

See https://dynamorio.org/page_debugging.html. Use debug build and logging https://dynamorio.org/page_logging.html to help see what is going on. Try to diff a passing and crashing arrays run using logging: if app is deterministic enough can do direct control flow comparison and then machine state comparison at branch divergence points.

Hi,
We have got stable reproducer on compress benchmark (SPECjvm2008).
The process constantly crashes on load/store instruction referencing memory at the first page (<0x1000).
We are looking into the logs generated by the command like this:

../drio_log/DynamoRIO-Linux-8.0.18611/bin64/drrun \
-debug -disable_traces \
-private_trace_ibl_targets_init 18 -private_bb_ibl_targets_init 16 \
-shared_ibt_table_bb_init 18 -shared_ibt_table_trace_init 18 \
-logmask 0x637fe -loglevel 4 -syntax_att -- \
/opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java \
-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:+PrintAssembly Main > drio_log.txt

after pausing the process just before crash (inserted dr_sleep() in DynamoRIO code right after log printing string) and attaching the process using hsdb debugger to have JVM related heap addresses annotated by JVM internal information.

Could you share some BKMs or ideas about how to unwind execution back from faulting instruction to find register context divergence point and then hypothesize about the possible reasons of divergence?

It seems like we have got control over DynamoRIO interpretation and translation in runtime so the next step is to learn how to extract the failure reasons from the logs and memory of JVM process using hsdb.

Thanks,
Alexei

derekbruening · 2021-06-24T15:52:19Z

As mentioned, the best case is to have one successful run to that point to compare against, with no non-determinism. Then the first branch divergence point is found, and then the machine state dumped at -loglevel 4 at each cache entry can be compared backward from that point.

Having the final mangled in-cache instruction list plus the machine state (-loglevel 4), a bad register value can be traced backward manually through the logs with some success, but indirect branches and linked sequences and being passed through memory all add complications.

If there is no true code modification (i.e., the JIT is always appending and never re-writing memory that had one instance of code with a new instance), I would try the options from above about disabling the cache consistency/selfmod handling to try to isolate which aspect of DR's handling has the problem.

I would look at code marked as "selfmod" (writable) which has a lot of instrumentation added to detect true self-modifying code: there could be bugs lurking in there. But if -no_sandbox_writes made no difference -- don't remember if that disables all the selfmod instrumentation or just the post-write stuff.

Other than cache consistency, other possible areas where bugs could lurk:

Segment usage: does the JVM use custom TLS with its own segment?
Newer system calls: does the JVM use syscalls added to Linux "recently" that DR doesn't handle (e.g., timerfd_create add support for SYS_timerfd_create #1139, sigtimedwait handle sigtimedwait and sigwaitinfo #1188)
Signals: if the JVM has an itimer or other signals try disabling those; but seems less likely given the non-Java apps that have no trouble w/ itimers
Rip-relative unreachable mangling: though again seems less likely

Running on AArch64, if easy to do, could be a way to try to compare w/ different cache consistency schemes.

abudankov · 2021-06-24T17:31:10Z

Thanks for advices. We will check thru the list. I appreciate that much.
Could you please also clarify "mangling" term in DynamoRIO specifics context?
What does that mean?

derekbruening · 2021-06-24T18:22:42Z

Could you please also clarify "mangling" term in DynamoRIO specifics context?

Code transformations performed by DR itself (as opposed to a tool/client) to maintain control are called "mangling". They occur last just prior to emitting into the code cache.

abudankov · 2021-06-24T18:27:00Z

ok. thanks!

abudankov · 2021-07-05T14:53:17Z

As mentioned, the best case is to have one successful run to that point to compare against, with no non-determinism. Then the first branch divergence point is found, and then the machine state dumped at -loglevel 4 at each cache entry can be compared backward from that point.

Having the final mangled in-cache instruction list plus the machine state (-loglevel 4), a bad register value can be traced backward manually through the logs with some success, but indirect branches and linked sequences and being passed through memory all add complications.

If there is no true code modification (i.e., the JIT is always appending and never re-writing memory that had one instance of code with a new instance), I would try the options from above about disabling the cache consistency/selfmod handling to try to isolate which aspect of DR's handling has the problem.

I would look at code marked as "selfmod" (writable) which has a lot of instrumentation added to detect true self-modifying code: there could be bugs lurking in there. But if -no_sandbox_writes made no difference -- don't remember if that disables all the selfmod instrumentation or just the post-write stuff.

Other than cache consistency, other possible areas where bugs could lurk:

Segment usage: does the JVM use custom TLS with its own segment?

Newer system calls: does the JVM use syscalls added to Linux "recently" that DR doesn't handle (e.g., timerfd_create add support for SYS_timerfd_create #1139, sigtimedwait handle sigtimedwait and sigwaitinfo #1188)

Signals: if the JVM has an itimer or other signals try disabling those; but seems less likely given the non-Java apps that have no trouble w/ itimers

Rip-relative unreachable mangling: though again seems less likely

Running on AArch64, if easy to do, could be a way to try to compare w/ different cache consistency schemes.

Hi,

Answering some of the questions above:

JVM generates Rip-relative code a lot. Are there any limitations in DynamoRIO translation related to that?
JVM implementation relies on signals, especially on SIGSEGV. Not sure it can be harmlessly disabled;
Have not checked usage of latest syscalls yet. We use openjdk8u built from sources on kernel 3.10 with libc 2.17 and gcc 4.8.5;
However JVM does use TLS it is unlikely custom since JVM failures under DynamoRIO have sporadic nature;

Meanwhile we managed to reproduce translation failure below on SPECjvm2008 compress benchmark:

<Resetting caches and non-persistent memory @ 720191 fragments in application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674).>

<Application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674). Internal Error: DynamoRIO debug check failure: /root/drio_8.0.18611-2/core/unix/signal.c:2625 false

DynamoRIO log around the place of failure is like this:

d_r_dispatch: target = 0x00007f32112e1743
Fragment 720469, tag 0x00007f32112e1743, flags 0x1000630, shared, size 50:
Entry into F720469(0x00007f32112e1743).0x00007f3425cefc79 (shared)
master_signal_handler: thread=2727675, sig=11, xsp=0x00007f3224ffb9b8, retaddr=0x00007f3468f395ab
computing memory target for 0x00007f3425cefc88 causing SIGSEGV, kernel claims it is 0x00007f3468c58000
memory operand 0 has address 0x00007f3468c58000 and size 4
For SIGSEGV at cache pc 0x00007f3425cefc88, computed target read 0x00007f3468c58000
faulting instr: test (%r10), %eax
** Received SIGSEGV at cache pc 0x00007f3425cefc88 in thread 2727675
record_pending_signal(11) from cache pc 0x00007f3425cefc88
not certain can delay so handling now
fragment overlaps selfmod area, inserting sandboxing
recreate_app : pc is in F720469(0x00007f32112e1743)
recreate_app -- WARNING: cache pc 0x00007f3425cefc8c != 0x00007f3425cefc88, probably prefix instruction
recreate_app -- found valid state pc 0x00007f32112e1743
restoring spilled rcx to 0x00007f31becd3a50
recreate_app -- found ok pc 0x00007f32112e1743
SYSLOG_WARNING: failed to translate
SYSLOG_ERROR: Application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674). Internal Error: DynamoRIO debug check failure: /root/drio_8.0.18611-2/core/unix/signal.c:2625 false

The issue happens at JIT compiled code which is like this:

0x00007f32112e173e: test %rax,%rax
0x00007f32112e1741: je 0x00007f32112e1756 ;*ifnonnull
; - java.io.BufferedInputStream::getBufIfOpen()[B@6 (line 169)

;; B2: # N33 <- B1 Freq: 1 IDom: 1/#3 RegPressure: 1 IHRP Index: 2 FRegPressure: 0 FHRP Index: 2

0x00007f32112e1743: add $0x20,%rsp
0x00007f32112e1747: pop %rbp
0x00007f32112e1748: movabs $0x7f3468c58000,%r10 ; {poll_return}
===> 0x00007f32112e1752: test %eax,(%r10) ; {poll_return}
0x00007f32112e1755: retq
;; B3: # N33 <- B1 Freq: 4.76837e-07 IDom: 1/#3 RegPressure: 12 IHRP Index: 9 FRegPressure: 32 FHRP Index: 4

0x00007f32112e1756: mov $0xffffff65,%esi
0x00007f32112e175b: mov %rax,0x8(%rsp)
0x00007f32112e1760: nop
0x00007f32112e1761: nop
0x00007f32112e1762: nop
0x00007f32112e1763: callq 0x00007f32111078a0 ; OopMap{[8]=Oop off=232}
;*ifnonnull
; - java.io.BufferedInputStream::getBufIfOpen()[B@6 (line 169)
; {runtime_call}

computed target read 0x00007f3468c58000 looks like global data at JVM:
[SafePoint Polling address: 0x00007f3468c58000]

Suspicions logging from DynamoRIO looks like this:

fragment overlaps selfmod area, inserting sandboxing
recreate_app -- WARNING: cache pc 0x00007f3425cefc8c != 0x00007f3425cefc88, probably prefix instruction
but JIT code is like this:
0x00007f32112e1752: test %eax,(%r10) ; {poll_return}
0x00007f32112e1755: retq

Any ideas on why cache pc is 4 bytes away from faulting instruction whereas the distance between the instructions at generated JIT code is 3 bytes?

Below is the example, from the same log, of successful translation of JIT code accessing the same faulting address:

Entry into F529662(0x00007f321127a40a).0x00007f3425bd144d (shared)

master_signal_handler: thread=2727675, sig=11, xsp=0x00007f3224ffb9b8, retaddr=0x00007f3468f395ab
computing memory target for 0x00007f3425bd1463 causing SIGSEGV, kernel claims it is 0x00007f3468c58000
memory operand 0 has address 0x00007f3468c58000 and size 4
For SIGSEGV at cache pc 0x00007f3425bd1463, computed target read 0x00007f3468c58000
faulting instr: test (%r10), %eax
** Received SIGSEGV at cache pc 0x00007f3425bd1463 in thread 2727675
record_pending_signal(11) from cache pc 0x00007f3425bd1463
not certain can delay so handling now
recreate_app : pc is in F529662(0x00007f321127a40a)
recreate_app -- found valid state pc 0x00007f321127a420
translation 0x00007f321127a420 is post-walk 0x0000000000000000 so not fixing xsp
recreate_app -- found ok pc 0x00007f321127a420
Got signal at pc 0x00007f3425bd1463 in this fragment:
Fragment 529662, tag 0x00007f321127a40a, flags 0x1000630, shared, size 57:

-------- indirect branch target entry: --------
0x00007f3425bd143c 67 65 48 a1 00 00 00 addr32 mov %gs:0x00, %rax
00
-------- prefix entry: --------
0x00007f3425bd1444 65 48 8b 0c 25 10 00 mov %gs:0x10, %rcx
00 00
-------- normal entry: --------
0x00007f3425bd144d 3b f2 cmp %esi, %edx
0x00007f3425bd144f 8b c2 mov %edx, %eax
0x00007f3425bd1451 0f 4e c6 cmovle %esi, %eax
0x00007f3425bd1454 48 83 c4 10 add $0x10, %rsp
0x00007f3425bd1458 5d pop %rbp
0x00007f3425bd1459 49 ba 00 80 c5 68 34 mov $0x00007f3468c58000, %r10
7f 00 00
0x00007f3425bd1463 41 85 02 test (%r10), %eax
0x00007f3425bd1466 65 48 89 0c 25 10 00 mov %rcx, %gs:0x10
00 00
0x00007f3425bd146f 59 pop %rcx
0x00007f3425bd1470 e9 cb f0 ec fe jmp $0x00007f3424aa0540 <shared_bb_ibl_ret>
-------- exit stub 0: -------- <target: 0x00007f3424aa0540> type: ret

Going to receive signal now
execute_handler_from_cache for signal 11
saved xax 0x0000000000000003
set next_tag to 0x00007f3467c75511, resuming in fcache_return
master_signal_handler 11 returning now to 0x00007f3424a9fe00

Exit from asynch event

Original JIT code:

0x00007f321127a409: hlt ;*synchronization entry
; - java.lang.Math::min(II)I@-1 (line 1336)

0x00007f321127a40a: cmp %edx,%esi
0x00007f321127a40c: mov %edx,%eax
0x00007f321127a40e: cmovle %esi,%eax ;*ireturn
; - java.lang.Math::min(II)I@10 (line 1336)

0x00007f321127a411: add $0x10,%rsp
0x00007f321127a415: pop %rbp
0x00007f321127a416: movabs $0x7f3468c58000,%r10 ; {poll_return}
0x00007f321127a420: test %eax,(%r10) ; {poll_return}
0x00007f321127a423: retq
0x00007f321127a424: hlt

derekbruening · 2021-07-06T02:44:34Z

JVM generates Rip-relative code a lot. Are there any limitations in DynamoRIO translation related to that?

DR should handle it just fine. It does require a local spilled register for far-away addresses. Maybe there could be fencepost errors on reachability or scratch register issues but rip-rel accesses are quite common so you would think such problems would not be limited to Java.

JVM implementation relies on signals, especially on SIGSEGV. Not sure it can be harmlessly disabled;

Have not checked usage of latest syscalls yet. We use openjdk8u built from sources on kernel 3.10 with libc 2.17 and gcc 4.8.5;

However JVM does use TLS it is unlikely custom since JVM failures under DynamoRIO have sporadic nature;

Meanwhile we managed to reproduce translation failure below on SPECjvm2008 compress benchmark:

<Resetting caches and non-persistent memory @ 720191 fragments in application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674).>

I would run with -no_enable_reset to eliminate the complexity of resets.

Entry into F720469(0x00007f32112e1743).0x00007f3425cefc79 (shared)
master_signal_handler: thread=2727675, sig=11, xsp=0x00007f3224ffb9b8, retaddr=0x00007f3468f395ab
computing memory target for 0x00007f3425cefc88 causing SIGSEGV, kernel claims it is 0x00007f3468c58000
memory operand 0 has address 0x00007f3468c58000 and size 4
For SIGSEGV at cache pc 0x00007f3425cefc88, computed target read 0x00007f3468c58000
faulting instr: test (%r10), %eax
** Received SIGSEGV at cache pc 0x00007f3425cefc88 in thread 2727675
record_pending_signal(11) from cache pc 0x00007f3425cefc88
not certain can delay so handling now
fragment overlaps selfmod area, inserting sandboxing

Translating state inside selfmod-sandboxed code is definitely an area where there could well be bugs.

Any ideas on why cache pc is 4 bytes away from faulting instruction whereas the distance between the instructions at generated JIT code is 3 bytes?

That is likely the bug causing the problems (or one of them): leading to incorrect register restoration or something.

Below is the example, from the same log, of successful translation of JIT code accessing the same faulting address:

Entry into F529662(0x00007f321127a40a).0x00007f3425bd144d (shared)

So the case where the same address does successfully translate has it as a regular (non-selfmod) fragment right? So it increasing looks like a selfmod translation problem.

I would create a tiny app with this precise block that modifies code on the same page (or, put it on the stack) so it gets marked selfmod and see if you can repro the translation problem and then easily run it repeatedly (b/c it's so small) w/ increasing diagnostics/logging/debugging.

abudankov · 2021-07-06T08:07:00Z

JVM generates Rip-relative code a lot. Are there any limitations in DynamoRIO translation related to that?

DR should handle it just fine. It does require a local spilled register for far-away addresses. Maybe there could be fencepost errors on reachability or scratch register issues but rip-rel accesses are quite common so you would think such problems would
not be limited to Java.

Does local spilled register mean a) preserve a register value on stack or somewhere in memory, b) put far address into the register c) implement equal instructions using value addressed via the register d) restore original value of the register back?
Are fencepost errors on reachability kind of races due lack of barriers because of machine code changes due to DRIO-mangling?
References to docs clarifying that terms are appreciated as well.

JVM implementation relies on signals, especially on SIGSEGV. Not sure it can be harmlessly disabled;

Have not checked usage of latest syscalls yet. We use openjdk8u built from sources on kernel 3.10 with libc 2.17 and gcc 4.8.5;

However JVM does use TLS it is unlikely custom since JVM failures under DynamoRIO have sporadic nature;

Meanwhile we managed to reproduce translation failure below on SPECjvm2008 compress benchmark:
<Resetting caches and non-persistent memory @ 720191 fragments in application /opt/openjdk8u/jvm/openjdk-1.8.0-internal-debug/bin/java (2727674).>

I would run with -no_enable_reset to eliminate the complexity of resets.

BTW, we spotted one more failure that happens after SIGUSR2 translation resulting in "Exit due to proactive reset" message in DR logs. Could you please elaborate more on what reset means in DRIO context and what it does? References to docs clarifying this term are appreciated as well.

Entry into F329048(0x00007fdcc1252a62).0x00007fded6d565ed (shared)
master_signal_handler: thread=2655048, sig=12, xsp=0x00007fdcd60d89b8, retaddr=0x00007fdf1a0585ab
siginfo: sig = 12, pid = 2655035, status = 0, errno = 0, si_code = -6
gs=0x0000
fs=0x0000
xdi=0x00007fdcd60b4490
xsi=0x00007fdcd60b44a0
xbp=0x00007fdcd60b44c0
xsp=0x00007fdcd60b4470
xbx=0x0000000000000002
xdx=0x00007fdcd60b4490
...
handle_suspend_signal: suspended now
translate_from_synchall_to_dispatch: being translated from 0x00007fdf1a0bf28a
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fdf1a0bf28a

save_fpstate
thread_set_self_context: pc=0x00007fded5bbfd71
full sigcontext
===>Exit due to proactive reset

d_r_dispatch: target = 0x00007fdcc1252a62
priv_mcontext_t @0x00007fdcd5c361c0
xax = 0x00000000edf09ca5
xbx = 0x000000000031012b
xcx = 0x0000000000000284
xdx = 0x000000000035c000
xsi = 0x000000076db9a838
xdi = 0x000000076f80aef0
...

Entry into F329048(0x00007fdcc1252a62).0x00007fded6d565ed (shared)

fcache_enter = 0x00007fded5bbed00, target = 0x00007fded6d565ed

master_signal_handler: thread=2655048, sig=11, xsp=0x00007fdcd60d89b8, retaddr=0x00007fdf1a0585ab
siginfo: sig = 11, pid = 659, status = 0, errno = 0, si_code = 1
gs=0x0000
fs=0x0000
xdi=0x000000076f80aef0
xsi=0x0000000000000000
xbp=0x000000076f84e528
xsp=0x00007fdf180a54e0
xbx=0x000000000031012b
xdx=0x000000000035c000
xcx=0x0000000000000284
xax=0x00000000edf09ca5
...

computing memory target for 0x00007fded6d565f2 causing SIGSEGV, kernel claims it is 0x0000000000000293

Entry into F720469(0x00007f32112e1743).0x00007f3425cefc79 (shared)
master_signal_handler: thread=2727675, sig=11, xsp=0x00007f3224ffb9b8, retaddr=0x00007f3468f395ab
computing memory target for 0x00007f3425cefc88 causing SIGSEGV, kernel claims it is 0x00007f3468c58000
memory operand 0 has address 0x00007f3468c58000 and size 4
For SIGSEGV at cache pc 0x00007f3425cefc88, computed target read 0x00007f3468c58000
faulting instr: test (%r10), %eax
** Received SIGSEGV at cache pc 0x00007f3425cefc88 in thread 2727675
record_pending_signal(11) from cache pc 0x00007f3425cefc88
not certain can delay so handling now
fragment overlaps selfmod area, inserting sandboxing

Translating state inside selfmod-sandboxed code is definitely an area where there could well be bugs.

Any ideas on why cache pc is 4 bytes away from faulting instruction whereas the distance between the instructions at generated JIT code is 3 bytes?

That is likely the bug causing the problems (or one of them): leading to incorrect register restoration or something.

Below is the example, from the same log, of successful translation of JIT code accessing the same faulting address:
Entry into F529662(0x00007f321127a40a).0x00007f3425bd144d (shared)

So the case where the same address does successfully translate has it as a regular (non-selfmod) fragment right? So it increasing looks like a selfmod translation problem.

Failed code address is not the same but instructions in those two failed and succeeded code addresses are quite similar. Physically they are different pieces of code, in different JIT compiled methods, but the pieces implement the same logic of checking value at the same global address located in JVM (static code implemented in C++, libjvm.so).

I would create a tiny app with this precise block that modifies code on the same page (or, put it on the stack) so it gets marked selfmod and see if you can repro the translation problem and then easily run it repeatedly (b/c it's so small) w/ increasing diagnostics/logging/debugging.

Yep, makes sense. Following that approach already.

Thanks!!!

derekbruening · 2021-07-06T15:21:53Z

Also xref using JVM annotations to avoid DR having to worry about true self-modifying code and having to use complex instrumentation to handle code changes on must-remain-writable pages: #3502 with some experimental code for an academic paper that was never merged into the main branch.

derekbruening · 2021-07-06T15:37:22Z

Does local spilled register mean a) preserve a register value on stack or somewhere in memory, b) put far address into the register c) implement equal instructions using value addressed via the register d) restore original value of the register back?

Yes, but the app stack is of course unsafe to use: thread-local storage via segment ref.

Are fencepost errors on reachability kind of races due lack of barriers because of machine code changes due to DRIO-mangling?

No, on whether it will reach or not (have to figure out ahead of time before place in code cache at final location). But I doubt there are bugs relating to rip-rel.

BTW, we spotted one more failure that happens after SIGUSR2 translation resulting in "Exit due to proactive reset" message in DR logs. Could you please elaborate more on what reset means in DRIO context and what it does? References to docs clarifying this term are appreciated as well.

"Reset" is an internal DR feature where it deletes all memory that can be re-created later: mostly code caches and associated heap. Kind of a garbage collection. Used to save memory and throw out cold code. I don't think there are any docs outside of the code itself.

abudankov · 2021-07-21T14:54:40Z

Observed memory corruptions in compiled nmethod (data and code) that JVM compiler thread (static code) tries to install from stack into JVM code cache on heap. To do that installation the thread calls libc memcpy() and executes following code. %rsi points to stack, %rdi points to heap.

0x7fcc635737ec: vmovdqu (%rsi),%ymm0
0x7fcc635737f0: vmovdqu 0x20(%rsi),%ymm1
0x7fcc635737f5: vmovdqu 0x40(%rsi),%ymm2
0x7fcc635737fa: vmovdqu 0x60(%rsi),%ymm3
0x7fcc635737ff: vmovdqu -0x20(%rsi,%rdx,1),%ymm4
0x7fcc63573805: vmovdqu -0x40(%rsi,%rdx,1),%ymm5
0x7fcc6357380b: vmovdqu -0x60(%rsi,%rdx,1),%ymm6
0x7fcc63573811: vmovdqu -0x80(%rsi,%rdx,1),%ymm7
0x7fcc63573817: vmovdqu %ymm0,(%rdi) <= SIGSEGV
0x7fcc6357381b: vmovdqu %ymm1,0x20(%rdi)
0x7fcc63573820: vmovdqu %ymm2,0x40(%rdi)
0x7fcc63573825: vmovdqu %ymm3,0x60(%rdi)
0x7fcc6357382a: vmovdqu %ymm4,-0x20(%rdi,%rdx,1)
0x7fcc63573830: vmovdqu %ymm5,-0x40(%rdi,%rdx,1)
0x7fcc63573836: vmovdqu %ymm6,-0x60(%rdi,%rdx,1)
0x7fcc6357383c: vmovdqu %ymm7,-0x80(%rdi,%rdx,1)

Instruction at 0x7fcc63573817 causes multiple SIGSEGVs under dynamorio and the instruction restarts again and again. At some that restarting gets two SIGUSR2 signals in a row (JVM employs SIGUSR2 by default to suspend/resume threads in JVM) and that ends up in zeroing vector registers in hw context of the thread and the compiled code becomes corrupted while being copied:

Before handling chained SIGUSR2 signals:

d_r_dispatch: target = 0x00007fcc63573817
priv_mcontext_t @0x00007fca23795e40
xax = 0x00007fca1c811c60
xbx = 0x00007fca1c811c60
xcx = 0x00007fca1c810360
xdx = 0x00000000000000f0
xsi = 0x00007fca1c810360
xdi = 0x00007fca1c811c60
xbp = 0x00007fc91c584420
xsp = 0x00007fc91c5843f8
r8 = 0x0000000000000000
r9 = 0x0000000000000000
r10 = 0x0000000000000000
r11 = 0x0000000000000293
r12 = 0x000000000000001e
r13 = 0x00007fca18029230
r14 = 0x00007fca1819bc20
r15 = 0x00000000ffffffff
ymm0= 0x08463b484916850f9090fff6909090900024848955fffea048ec8b484810ec83
ymm1= 0x082444c7badb100dc48b48500fe0834808f883487f840f5848000000d8246489
ymm2= 0x80ec81484800000078244489244c894854894870894868244860245c50246c89
ymm3= 0x247489487c894848894c40244c38244430244c892454894c5c894c28894c2024
ymm4= 0xd1f2684241c223440b41c08bc48348c2ba495d106383b00000007fccc3028541
ymm5= 0x7fcc6274ff410000d12bf4d241da8b44fac1cbfffbc1411fd333411fc4d32341
ymm6= 0xe8bf4824cc62cdeb4800007f81039bbe007fca1c + d48b4800f0e483487e6aba49
ymm7= 0x30244c892454894c5c894c28894c20244c18246410246c892474894c3c894c08
ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
mxcsr=0x00001f80
eflags = 0x0000000000000202
pc = 0x00007fcc202661f4
Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared)

After handling chained SIGUSR2 signals:

_r_dispatch: target = 0x00007fcc63573817
priv_mcontext_t @0x00007fca23795e40
xax = 0x00007fca1c811c60
xbx = 0x00007fca1c811c60
xcx = 0x00007fca1c810360
xdx = 0x00000000000000f0
xsi = 0x00007fca1c810360
xdi = 0x00007fca1c811c60
xbp = 0x00007fc91c584420
xsp = 0x00007fc91c5843f8
r8 = 0x0000000000000000
r9 = 0x0000000000000000
r10 = 0x0000000000000000
r11 = 0x0000000000000293
r12 = 0x000000000000001e
r13 = 0x00007fca18029230
r14 = 0x00007fca1819bc20
r15 = 0x00000000ffffffff
ymm0= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm1= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm2= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm3= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm4= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm5= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm6= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm7= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
mxcsr=0x00001f80
eflags = 0x0000000000000202
pc = 0x00007fcc63573817
Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared)

Part of log around handling SIGUSR2 signals follows:

d_r_dispatch: target = 0x00007fcc63573817
priv_mcontext_t @0x00007fca23795e40
xax = 0x00007fca1c811c60
xbx = 0x00007fca1c811c60
xcx = 0x00007fca1c810360
xdx = 0x00000000000000f0
xsi = 0x00007fca1c810360
xdi = 0x00007fca1c811c60
xbp = 0x00007fc91c584420
xsp = 0x00007fc91c5843f8
r8 = 0x0000000000000000
r9 = 0x0000000000000000
r10 = 0x0000000000000000
r11 = 0x0000000000000293
r12 = 0x000000000000001e
r13 = 0x00007fca18029230
r14 = 0x00007fca1819bc20
r15 = 0x00000000ffffffff
ymm0= 0x08463b484916850f9090fff6909090900024848955fffea048ec8b484810ec83
ymm1= 0x082444c7badb100dc48b48500fe0834808f883487f840f5848000000d8246489
ymm2= 0x80ec81484800000078244489244c894854894870894868244860245c50246c89
ymm3= 0x247489487c894848894c40244c38244430244c892454894c5c894c28894c2024
ymm4= 0xd1f2684241c223440b41c08bc48348c2ba495d106383b00000007fccc3028541
ymm5= 0x7fcc6274ff410000d12bf4d241da8b44fac1cbfffbc1411fd333411fc4d32341
ymm6= 0xe8bf4824cc62cdeb4800007f81039bbe007fca1c
master_signal_handler: thread=2868191, sig=12, xsp=0x00007fca23b009b8, retaddr=0x00007fcc63b1c5ab
siginfo: sig = 12, pid = 2868012, status = 0, errno = 0, si_code = -6
gs=0x0000
fs=0x0000
xdi=0x000000000007ffb2
xsi=0x00007fca23adc410
xbp=0x00007fca23adc3b0
xsp=0x00007fca23adc380
xbx=0x0000000000000003
xdx=0x0000000000000008
xcx=0x00007fcc63b8322d
xax=0x0000000000000008
r8=0x0000000000000008
r9=0x00007fca1c810360
r10=0x00007fca23adc410
r11=0x0000000000000246
r12=0x00007fca1c811c60
r13=0x00007fc91c584420
r14=0x00007fc91c5843f8
r15=0x0000000000000000
trapno=0x000000000000000e
err=0x0000000000000007
xip=0x00007fcc63b8322d
cs=0x0033
eflags=0x0000000000000246
cwd=0x000000000000037f
swd=0x0000000000000000
twd=0x0000000000000000
fop=0x0000000000000000
rip=0x0000000000000000
rdp=0x0000000000000000
mxcsr=0x0000000000001f80
mxcsr_mask=0x000000000000ffff
st0 = 0x00000000000000000000000000000000
st1 = 0x00000000000000000000000000000000
st2 = 0x00000000000000000000000000000000
st3 = 0x00000000000000000000000000000000
st4 = 0x00000000000000000000000000000000
st5 = 0x00000000000000000000000000000000
st6 = 0x00000000000000000000000000000000
st7 = 0x00000000000000000000000000000000
xmm0 = 0x08463b484916850f9090fff690909090
xmm1 = 0x082444c7badb100dc48b48500fe08348
xmm2 = 0x80ec81484800000078244489244c8948
xmm3 = 0x247489487c894848894c40244c382444
xmm4 = 0xd1f2684241c223440b41c08bc48348c2
xmm5 = 0x7fcc6274ff410000d12bf4d241da8b44
xmm6 = 0xe8bf4824cc62cdeb4800007f81039bbe
xmm7 = 0x30244c892454894c5c894c28894c2024
xmm8 = 0x00000000000000000000000000000000
xmm9 = 0x00000000000000000000000000000000
xmm10 = 0x00000000000000000000000000000000
xmm11 = 0x00000000000000000000000000000000
xmm12 = 0x00000000000000000000000000000000
xmm13 = 0x00000000000000000000000000000000
xmm14 = 0x00000000000000000000000000000000
xmm15 = 0x00000000000000000000000000000000
xstate_bv = 0x7
ymmh0 = 0024848955fffea048ec8b484810ec83
ymmh1 = 08f883487f840f5848000000d8246489
ymmh2 = 54894870894868244860245c50246c89
ymmh3 = 30244c892454894c5c894c28894c2024
ymmh4 = ba495d106383b00000007fccc3028541
ymmh5 = fac1cbfffbc1411fd333411fc4d32341
ymmh6 = 007fca1cd48b4800f0e483487e6aba49
ymmh7 = 4c18246410246c892474894c3c894c08
ymmh8 = 00000000000000000000000000000000
ymmh9 = 00000000000000000000000000000000
ymmh10 = 00000000000000000000000000000000
ymmh11 = 00000000000000000000000000000000
ymmh12 = 00000000000000000000000000000000
ymmh13 = 00000000000000000000000000000000
ymmh14 = 00000000000000000000000000000000
ymmh15 = 00000000000000000000000000000000
oldmask=0x0000000000000000
cr2=0x00007fca1c811c60
handle_suspend_signal: suspended now
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fcc63b8322d

d48b4800f0e483487e6aba49
ymm7= 0x30244c892454894c5c894c28894c20244c18246410246c892474894c3c894c08
ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
mxcsr=0x00001f80
eflags = 0x0000000000000202
pc = 0x00007fcc202661f4
Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared)

master_signal_handler: thread=2868191, sig=12, xsp=0x00007fca23b009b8, retaddr=0x00007fcc63b1c5ab
siginfo: sig = 12, pid = 2868012, status = 0, errno = 0, si_code = -6
gs=0x0000
fs=0x0000
xdi=0x00007fca23adc490
xsi=0x00007fca23adc4a0
xbp=0x00007fca23adc4c0
xsp=0x00007fca23adc470
xbx=0x0000000000000002
xdx=0x00007fca23adc490
xcx=0x00007fcc63b8322d
xax=0x0000000000000000
r8=0x0000000000000000
r9=0x0000000000000000
r10=0x00007fca23adc4a0
r11=0x0000000000000246
r12=0x000000000000001e
r13=0x00007fca18029230
r14=0x00007fca1819bc20
r15=0x00000000ffffffff
trapno=0x000000000000000e
err=0x0000000000000007
xip=0x00007fcc63b8322d
cs=0x0033
eflags=0x0000000000000246
cwd=0x000000000000037f
swd=0x0000000000000000
twd=0x0000000000000000
fop=0x0000000000000000
rip=0x0000000000000000
rdp=0x0000000000000000
mxcsr=0x0000000000001f80
mxcsr_mask=0x000000000000ffff
st0 = 0x00000000000000000000000000000000
st1 = 0x00000000000000000000000000000000
st2 = 0x00000000000000000000000000000000
st3 = 0x00000000000000000000000000000000
st4 = 0x00000000000000000000000000000000
st5 = 0x00000000000000000000000000000000
st6 = 0x00000000000000000000000000000000
st7 = 0x00000000000000000000000000000000
xmm0 = 0x08463b484916850f9090fff690909090
xmm1 = 0x082444c7badb100dc48b48500fe08348
xmm2 = 0x80ec81484800000078244489244c8948
xmm3 = 0x247489487c894848894c40244c382444
xmm4 = 0xd1f2684241c223440b41c08bc48348c2
xmm5 = 0x7fcc6274ff410000d12bf4d241da8b44
xmm6 = 0xe8bf4824cc62cdeb4800007f81039bbe
xmm7 = 0x30244c892454894c5c894c28894c2024
xmm8 = 0x00000000000000000000000000000000
xmm9 = 0x00000000000000000000000000000000
xmm10 = 0x00000000000000000000000000000000
xmm11 = 0x00000000000000000000000000000000
xmm12 = 0x00000000000000000000000000000000
xmm13 = 0x00000000000000000000000000000000
xmm14 = 0x00000000000000000000000000000000
xmm15 = 0x00000000000000000000000000000000
xstate_bv = 0x7
ymmh0 = 0024848955fffea048ec8b484810ec83
ymmh1 = 08f883487f840f5848000000d8246489
ymmh2 = 54894870894868244860245c50246c89
ymmh3 = 30244c892454894c5c894c28894c2024
ymmh4 = ba495d106383b00000007fccc3028541
ymmh5 = fac1cbfffbc1411fd333411fc4d32341
ymmh6 = 007fca1cd48b4800f0e483487e6aba49
ymmh7 = 4c18246410246c892474894c3c894c08
ymmh8 = 00000000000000000000000000000000
ymmh9 = 00000000000000000000000000000000
ymmh10 = 00000000000000000000000000000000
ymmh11 = 00000000000000000000000000000000
ymmh12 = 00000000000000000000000000000000
ymmh13 = 00000000000000000000000000000000
ymmh14 = 00000000000000000000000000000000
ymmh15 = 00000000000000000000000000000000
oldmask=0x0000000000000000
cr2=0x00007fca1c811c60
handle_suspend_signal: suspended now
translate_from_synchall_to_dispatch: being translated from 0x00007fcc63b8322d
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fcc63b8322d

save_fpstate
thread_set_self_context: pc=0x00007fcc1f683d71
full sigcontext
Exit due to proactive reset

d_r_dispatch: target = 0x00007fcc63573817
priv_mcontext_t @0x00007fca23795e40
xax = 0x00007fca1c811c60
xbx = 0x00007fca1c811c60
xcx = 0x00007fca1c810360
xdx = 0x00000000000000f0
xsi = 0x00007fca1c810360
xdi = 0x00007fca1c811c60
xbp = 0x00007fc91c584420
xsp = 0x00007fc91c5843f8
r8 = 0x0000000000000000
r9 = 0x0000000000000000
r10 = 0x0000000000000000
r11 = 0x0000000000000293
r12 = 0x000000000000001e
r13 = 0x00007fca18029230
r14 = 0x00007fca1819bc20
r15 = 0x00000000ffffffff
ymm0= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm1= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm2= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm3= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm4= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm5= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm6= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm7= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
mxcsr=0x00001f80
eflags = 0x0000000000000202
pc = 0x00007fcc63573817
Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared)

Log of handling SEGV follows:

Entry into F171512(0x00007fcc63573817).0x00007fcc202661f4 (shared)

fcache_enter = 0x00007fcc1f682d00, target = 0x00007fcc202661f4

master_signal_handler: thread=2868191, sig=11, xsp=0x00007fca23b009b8, retaddr=0x00007fcc63b1c5ab
siginfo: sig = 11, pid = 478223456, status = 0, errno = 0, si_code = 2
gs=0x0000
fs=0x0000
xdi=0x00007fca1c811c60
xsi=0x00007fca1c810360
xbp=0x00007fc91c584420
xsp=0x00007fc91c5843f8
xbx=0x00007fca1c811c60
xdx=0x00000000000000f0
xcx=0x00007fca1c810360
xax=0x00007fca1c811c60
r8=0x0000000000000000
r9=0x0000000000000000
r10=0x0000000000000000
r11=0x0000000000000293
r12=0x000000000000001e
r13=0x00007fca18029230
r14=0x00007fca1819bc20
r15=0x00000000ffffffff
trapno=0x000000000000000e
err=0x0000000000000007
xip=0x00007fcc202661f4
cs=0x0033
eflags=0x0000000000010202
cwd=0x000000000000037f
swd=0x0000000000000000
twd=0x0000000000000000
fop=0x0000000000000000
rip=0x0000000000000000
rdp=0x0000000000000000
mxcsr=0x0000000000001f80
mxcsr_mask=0x000000000000ffff
st0 = 0x00000000000000000000000000000000
st1 = 0x00000000000000000000000000000000
st2 = 0x00000000000000000000000000000000
st3 = 0x00000000000000000000000000000000
st4 = 0x00000000000000000000000000000000
st5 = 0x00000000000000000000000000000000
st6 = 0x00000000000000000000000000000000
st7 = 0x00000000000000000000000000000000
xmm0 = 0x08463b484916850f9090fff690909090
xmm1 = 0x082444c7badb100dc48b48500fe08348
xmm2 = 0x80ec81484800000078244489244c8948
xmm3 = 0x247489487c894848894c40244c382444
xmm4 = 0xd1f2684241c223440b41c08bc48348c2
xmm5 = 0x7fcc6274ff410000d12bf4d241da8b44
xmm6 = 0xe8bf4824cc62cdeb4800007f81039bbe
xmm7 = 0x30244c892454894c5c894c28894c2024
xmm8 = 0x00000000000000000000000000000000
xmm9 = 0x00000000000000000000000000000000
xmm10 = 0x00000000000000000000000000000000
xmm11 = 0x00000000000000000000000000000000
xmm12 = 0x00000000000000000000000000000000
xmm13 = 0x00000000000000000000000000000000
xmm14 = 0x00000000000000000000000000000000
xmm15 = 0x00000000000000000000000000000000
xstate_bv = 0x7
ymmh0 = 0024848955fffea048ec8b484810ec83
ymmh1 = 08f883487f840f5848000000d8246489
ymmh2 = 54894870894868244860245c50246c89
ymmh3 = 30244c892454894c5c894c28894c2024
ymmh4 = ba495d106383b00000007fccc3028541
ymmh5 = fac1cbfffbc1411fd333411fc4d32341
ymmh6 = 007fca1cd48b4800f0e483487e6aba49
ymmh7 = 4c18246410246c892474894c3c894c08
ymmh8 = 00000000000000000000000000000000
ymmh9 = 00000000000000000000000000000000
ymmh10 = 00000000000000000000000000000000
ymmh11 = 00000000000000000000000000000000
ymmh12 = 00000000000000000000000000000000
ymmh13 = 00000000000000000000000000000000
ymmh14 = 00000000000000000000000000000000
ymmh15 = 00000000000000000000000000000000
oldmask=0x0000000000000000
cr2=0x00007fca1c811c60
computing memory target for 0x00007fcc202661f4 causing SIGSEGV, kernel claims it is 0x00007fca1c811c60
opnd_compute_address for: (%rdi)
base => 0x00007fca1c811c60
index,scale => 0x00007fca1c811c60
disp => 0x00007fca1c811c60
memory operand 0 has address 0x00007fca1c811c60 and size 32
For SIGSEGV at cache pc 0x00007fcc202661f4, computed target write 0x00007fca1c811c60
faulting instr: vmovdqu %ymm0, (%rdi)
recreate_app_pc -- translating from pc=0x00007fcc202661f4

building bb instrlist now *********************

interp: start_pc = 0x00007fcc63573817
check_thread_vm_area: pc = 0x00007fcc63573817
prepend_entry_to_fraglist: putting fragment @0x00007fcc63573817 (shared) on vmarea 0x00007fcc63436000-0x00007fcc63583000
check_thread_vm_area: check_stop = 0x00007fcc63583000
0x00007fcc63573817 c5 fe 7f 07 vmovdqu %ymm0, (%rdi)
0x00007fcc6357381b c5 fe 7f 4f 20 vmovdqu %ymm1, 0x20(%rdi)
0x00007fcc63573820 c5 fe 7f 57 40 vmovdqu %ymm2, 0x40(%rdi)
0x00007fcc63573825 c5 fe 7f 5f 60 vmovdqu %ymm3, 0x60(%rdi)
0x00007fcc6357382a c5 fe 7f 64 17 e0 vmovdqu %ymm4, -0x20(%rdi,%rdx)
0x00007fcc63573830 c5 fe 7f 6c 17 c0 vmovdqu %ymm5, -0x40(%rdi,%rdx)
0x00007fcc63573836 c5 fe 7f 74 17 a0 vmovdqu %ymm6, -0x60(%rdi,%rdx)
0x00007fcc6357383c c5 fe 7f 7c 17 80 vmovdqu %ymm7, -0x80(%rdi,%rdx)
0x00007fcc63573842 c5 f8 77 vzeroupper
0x00007fcc63573845 c3 ret
mbr exit target = 0x00007fcc1f683540
end_pc = 0x00007fcc63573846

setting cur_pc (for fall-through) to 0x00007fcc63573846
forward_eflags_analysis: vmovdqu %ymm0, (%rdi)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm1, 0x20(%rdi)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm2, 0x40(%rdi)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm3, 0x60(%rdi)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm4, -0x20(%rdi,%rdx)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm5, -0x40(%rdi,%rdx)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm6, -0x60(%rdi,%rdx)
instr 0 => 0
forward_eflags_analysis: vmovdqu %ymm7, -0x80(%rdi,%rdx)
instr 0 => 0
forward_eflags_analysis: vzeroupper
instr 0 => 0
exit_branch_type=0x6 bb->exit_target=0x00007fcc1f683540
bb ilist before mangling:
TAG 0x00007fcc63573817
+0 L3 @0x00007fca23ae8f68 c5 fe 7f 07 vmovdqu %ymm0, (%rdi)
+4 L3 @0x00007fca23aebf50 c5 fe 7f 4f 20 vmovdqu %ymm1, 0x20(%rdi)
+9 L3 @0x00007fca23e5c648 c5 fe 7f 57 40 vmovdqu %ymm2, 0x40(%rdi)
+14 L3 @0x00007fca23ae70e8 c5 fe 7f 5f 60 vmovdqu %ymm3, 0x60(%rdi)
+19 L3 @0x00007fca23e5c0c8 c5 fe 7f 64 17 e0 vmovdqu %ymm4, -0x20(%rdi,%rdx)
+25 L3 @0x00007fca23ae8700 c5 fe 7f 6c 17 c0 vmovdqu %ymm5, -0x40(%rdi,%rdx)
+31 L3 @0x00007fca23ae8ce8 c5 fe 7f 74 17 a0 vmovdqu %ymm6, -0x60(%rdi,%rdx)
+37 L3 @0x00007fca23e5d048 c5 fe 7f 7c 17 80 vmovdqu %ymm7, -0x80(%rdi,%rdx)
+43 L3 @0x00007fca23ae7200 c5 f8 77 vzeroupper
+46 L3 @0x00007fca23e5cc48 c3 ret
+47 L4 @0x00007fca23e5c6c8 e9 8b 50 b8 fb jmp $0x00007fcc1f683540 <shared_bb_ibl_ret>
END 0x00007fcc63573817

bb ilist after mangling:
TAG 0x00007fcc63573817
+0 L3 @0x00007fca23ae8f68 c5 fe 7f 07 vmovdqu %ymm0, (%rdi)
+4 L3 @0x00007fca23aebf50 c5 fe 7f 4f 20 vmovdqu %ymm1, 0x20(%rdi)
+9 L3 @0x00007fca23e5c648 c5 fe 7f 57 40 vmovdqu %ymm2, 0x40(%rdi)
+14 L3 @0x00007fca23ae70e8 c5 fe 7f 5f 60 vmovdqu %ymm3, 0x60(%rdi)
+19 L3 @0x00007fca23e5c0c8 c5 fe 7f 64 17 e0 vmovdqu %ymm4, -0x20(%rdi,%rdx)
+25 L3 @0x00007fca23ae8700 c5 fe 7f 6c 17 c0 vmovdqu %ymm5, -0x40(%rdi,%rdx)
+31 L3 @0x00007fca23ae8ce8 c5 fe 7f 74 17 a0 vmovdqu %ymm6, -0x60(%rdi,%rdx)
+37 L3 @0x00007fca23e5d048 c5 fe 7f 7c 17 80 vmovdqu %ymm7, -0x80(%rdi,%rdx)
+43 L3 @0x00007fca23ae7200 c5 f8 77 vzeroupper
+46 m4 @0x00007fca23ae7600 65 48 89 0c 25 10 00 mov %rcx, %gs:0x10
00 00
+55 m4 @0x00007fca23e5c148 59 pop %rcx
+56 L4 @0x00007fca23e5c6c8 e9 8b 50 b8 fb jmp $0x00007fcc1f683540 <shared_bb_ibl_ret>
END 0x00007fcc63573817

done building bb instrlist *********************

vm_area_remove_fragment: entry 0x00007fca24864668
nop_pad_ilist: F171512 @0x00007fcc2026622c cti shift needed: 0
recreate_app : pc is in F171512(0x00007fcc63573817)
recreate_app : looking for 0x00007fcc202661f4 in frag @ 0x00007fcc202661f4 (tag 0x00007fcc63573817)
recreate_app -- found valid state pc 0x00007fcc63573817
recreate_app -- found ok pc 0x00007fcc63573817
recreate_app_pc -- translation is 0x00007fcc63573817
WARNING: Exec 0x00007fca1c66f000-0x00007fca1c8df000 WE written @0x00007fca1c811c60 by 0x00007fcc202661f4 == app 0x00007fcc63573817
vm_list_overlaps 0x00007fca2485f438 vs 0x00007fca1c66f000-0x00007fca1c8df000
instr not in region, flushing entire 0x00007fca1c66f000-0x00007fca1c8df000
FLUSH STAGE 1: synch_unlink_priv(thread 2868191 flushtime 4444): 0x00007fca1c66f000-0x00007fca1c8df000
make_writable: pc 0x00007fcc1f682000 -> 0x00007fcc1f682000-0x00007fcc1f684000 0
make_unwritable: pc 0x00007fcc1f682000 -> 0x00007fcc1f682000-0x00007fcc1f684000
considering thread #1/20 = 2868012
thread 2868012 synch not required
thread 2868012 has no fragments in region to flush
considering thread #2/20 = 2868013
waiting for thread 2868013
thread 2868191 waiting for event 0x00007fca1f6fcbd0
thread 2868191 finished waiting for event 0x00007fca1f6fcbd0
done waiting for thread 2868013
thread 2868013 has no fragments in region to flush
considering thread #3/20 = 2868015
thread 2868015 synch not required
thread 2868015 has no fragments in region to flush
considering thread #4/20 = 2868016
thread 2868016 synch not required
thread 2868016 has no fragments in region to flush
considering thread #5/20 = 2868017
thread 2868017 synch not required
thread 2868017 has no fragments in region to flush
considering thread #6/20 = 2868018
thread 2868018 synch not required
thread 2868018 has no fragments in region to flush
considering thread #7/20 = 2868019
thread 2868019 synch not required
thread 2868019 has no fragments in region to flush
considering thread #8/20 = 2868020
thread 2868020 synch not required
thread 2868020 has no fragments in region to flush
considering thread #9/20 = 2868021
thread 2868021 synch not required
thread 2868021 has no fragments in region to flush
considering thread #10/20 = 2868022
thread 2868022 synch not required
thread 2868022 has no fragments in region to flush
considering thread #11/20 = 2868023
thread 2868023 synch not required
thread 2868023 has no fragments in region to flush
considering thread #12/20 = 2868024
thread 2868024 synch not required
thread 2868024 has no fragments in region to flush
considering thread #13/20 = 2868159
thread 2868159 synch not required
thread 2868159 has no fragments in region to flush
considering thread #14/20 = 2868173
thread 2868173 synch not required
thread 2868173 has no fragments in region to flush
considering thread #15/20 = 2868182
thread 2868182 synch not required
thread 2868182 has no fragments in region to flush
considering thread #16/20 = 2868189
thread 2868189 synch not required
thread 2868189 has no fragments in region to flush
considering thread #17/20 = 2868190
thread 2868190 synch not required
thread 2868190 has no fragments in region to flush
considering thread #18/20 = 2868191
thread 2868191 synch not required
thread 2868191 has no fragments in region to flush
considering thread #19/20 = 2868192
thread 2868192 synch not required
thread 2868192 has no fragments in region to flush
considering thread #20/20 = 2868193
thread 2868193 synch not required
thread 2868193 has no fragments in region to flush
FLUSH STAGE 2: unlink_shared(thread 2868191): flusher is 2868191
flushing shared fragments
vm_area_unlink_fragments 0x00007fca1c66f000..0x00007fca1c8df000
marking region 0x00007fca1c66f000..0x00007fca1c8df000 for deletion & unlinking all its frags
Before removing vm area:
0x00007fc907de3000-0x00007fc907ffe000 ---- libnet.so
0x00007fc91c072000-0x00007fc91c283000 ---- libnio.so
0x00007fca1c66f000-0x00007fca1c8df000 W--- unexpected vm area
0x00007fcc5f670000-0x00007fcc5f671000 ---- ELF SO lib.so
0x00007fcc60627000-0x00007fcc60978000 ---- hsdis-amd64.so
0x00007fcc60b11000-0x00007fcc60d34000 ---- libzip.so
0x00007fcc6160a000-0x00007fcc6183a000 ---- libjava.so
0x00007fcc6183d000-0x00007fcc61a4e000 ---- libverify.so
0x00007fcc61a54000-0x00007fcc61a58000 ---- librt.so.1
0x00007fcc61b6c000-0x00007fcc61c06000 ---- libm.so.6
0x00007fcc61ca1000-0x00007fcc63287000 ---- libjvm.so
0x00007fcc63436000-0x00007fcc63583000 ---- libc.so.6
0x00007fcc635e1000-0x00007fcc635e3000 ---- libdl.so.2
0x00007fcc635e6000-0x00007fcc637ff000 ---- libjli.so
0x00007fcc63808000-0x00007fcc63817000 ---- libpthread.so.0
0x00007fcc6382f000-0x00007fcc63836000 ---- libnss_sss.so.2
0x00007fcc6383f000-0x00007fcc63864000 ---- ELF SO ld-linux-x86-64.so.2
0x00007ffe89515000-0x00007ffe89517000 ---- Private linux-vdso.so.1
Removing shared vm area 0x00007fca1c66f000-0x00007fca1c8df000
After removing vm area:
0x00007fc907de3000-0x00007fc907ffe000 ---- libnet.so
0x00007fc91c072000-0x00007fc91c283000 ---- libnio.so
0x00007fcc5f670000-0x00007fcc5f671000 ---- ELF SO lib.so
0x00007fcc60627000-0x00007fcc60978000 ---- hsdis-amd64.so
0x00007fcc60b11000-0x00007fcc60d34000 ---- libzip.so
0x00007fcc6160a000-0x00007fcc6183a000 ---- libjava.so
0x00007fcc6183d000-0x00007fcc61a4e000 ---- libverify.so
0x00007fcc61a54000-0x00007fcc61a58000 ---- librt.so.1
0x00007fcc61b6c000-0x00007fcc61c06000 ---- libm.so.6
0x00007fcc61ca1000-0x00007fcc63287000 ---- libjvm.so
0x00007fcc63436000-0x00007fcc63583000 ---- libc.so.6
0x00007fcc635e1000-0x00007fcc635e3000 ---- libdl.so.2
0x00007fcc635e6000-0x00007fcc637ff000 ---- libjli.so
0x00007fcc63808000-0x00007fcc63817000 ---- libpthread.so.0
0x00007fcc6382f000-0x00007fcc63836000 ---- libnss_sss.so.2
0x00007fcc6383f000-0x00007fcc63864000 ---- ELF SO ld-linux-x86-64.so.2
0x00007ffe89515000-0x00007ffe89517000 ---- Private linux-vdso.so.1
Unlinked 1 frags
make_writable: pc 0x00007fcc1f682000 -> 0x00007fcc1f682000-0x00007fcc1f684000 0
make_unwritable: pc 0x00007fcc1f682000 -> 0x00007fcc1f682000-0x00007fcc1f684000
Flushed 1 fragments from 0x00007fca1c66f000-0x00007fca1c8df000
make_writable: pc 0x00007fca1c66f000 -> 0x00007fca1c66f000-0x00007fca1c8df000 0
Removed 0x00007fca1c66f000-0x00007fca1c8df000 from exec list, continuing @ write

executable areas:
0x00007fc907de3000-0x00007fc907ffe000 ---- libnet.so
0x00007fc91c072000-0x00007fc91c283000 ---- libnio.so
0x00007fcc5f670000-0x00007fcc5f671000 ---- ELF SO lib.so
0x00007fcc60627000-0x00007fcc60978000 ---- hsdis-amd64.so
0x00007fcc60b11000-0x00007fcc60d34000 ---- libzip.so
0x00007fcc6160a000-0x00007fcc6183a000 ---- libjava.so
0x00007fcc6183d000-0x00007fcc61a4e000 ---- libverify.so
0x00007fcc61a54000-0x00007fcc61a58000 ---- librt.so.1
0x00007fcc61b6c000-0x00007fcc61c06000 ---- libm.so.6
0x00007fcc61ca1000-0x00007fcc63287000 ---- libjvm.so
0x00007fcc63436000-0x00007fcc63583000 ---- libc.so.6
0x00007fcc635e1000-0x00007fcc635e3000 ---- libdl.so.2
0x00007fcc635e6000-0x00007fcc637ff000 ---- libjli.so
0x00007fcc63808000-0x00007fcc63817000 ---- libpthread.so.0
0x00007fcc6382f000-0x00007fcc63836000 ---- libnss_sss.so.2
0x00007fcc6383f000-0x00007fcc63864000 ---- ELF SO ld-linux-x86-64.so.2
0x00007fcc638ba000-0x00007fcc63b86000 ---- ELF SO libdynamorio.so
0x00007ffe89515000-0x00007ffe89517000 ---- Private linux-vdso.so.1
0xffffffffff600000-0xffffffffff601000 ---- Private

thread areas:
0x00007fcc63436000-0x00007fcc63583000 ---- libc.so.6
0x00007fcc63808000-0x00007fcc63817000 ---- libpthread.so.0
FLUSH STAGE 3: end_synch(thread 2868191): flusher is 2868191
thread 2868191 signalling event 0x00007fca237a0dd0
thread 2868191 signalling event 0x00007fca1f6fcc68
saved xax 0x00007fca1c811c60
set next_tag to 0x00007fcc63573817, resuming in fcache_return
transfer_from_sig_handler_to_fcache_return
sigcontext @0x00007fca23b009e8:
gs=0x0000
fs=0x0000
xdi=0x00007fca1c811c60
xsi=0x00007fca1c810360
xbp=0x00007fc91c584420
xsp=0x00007fc91c5843f8
xbx=0x00007fca1c811c60
xdx=0x00000000000000f0
xcx=0x00007fca1c810360
xax=0x00007fcc63bac84c
r8=0x0000000000000000
r9=0x0000000000000000
r10=0x0000000000000000
r11=0x0000000000000293
r12=0x000000000000001e
r13=0x00007fca18029230
r14=0x00007fca1819bc20
r15=0x00000000ffffffff
trapno=0x000000000000000e
err=0x0000000000000007
xip=0x00007fcc1f682e00
cs=0x0033
eflags=0x0000000000010202
cwd=0x000000000000037f
swd=0x0000000000000000
twd=0x0000000000000000
fop=0x0000000000000000
rip=0x0000000000000000
rdp=0x0000000000000000
mxcsr=0x0000000000001f80
mxcsr_mask=0x000000000000ffff
st0 = 0x00000000000000000000000000000000
st1 = 0x00000000000000000000000000000000
st2 = 0x00000000000000000000000000000000
st3 = 0x00000000000000000000000000000000
st4 = 0x00000000000000000000000000000000
st5 = 0x00000000000000000000000000000000
st6 = 0x00000000000000000000000000000000
st7 = 0x00000000000000000000000000000000
xmm0 = 0x08463b484916850f9090fff690909090
xmm1 = 0x082444c7badb100dc48b48500fe08348
xmm2 = 0x80ec81484800000078244489244c8948
xmm3 = 0x247489487c894848894c40244c382444
xmm4 = 0xd1f2684241c223440b41c08bc48348c2
xmm5 = 0x7fcc6274ff410000d12bf4d241da8b44
xmm6 = 0xe8bf4824cc62cdeb4800007f81039bbe
xmm7 = 0x30244c892454894c5c894c28894c2024
xmm8 = 0x00000000000000000000000000000000
xmm9 = 0x00000000000000000000000000000000
xmm10 = 0x00000000000000000000000000000000
xmm11 = 0x00000000000000000000000000000000
xmm12 = 0x00000000000000000000000000000000
xmm13 = 0x00000000000000000000000000000000
xmm14 = 0x00000000000000000000000000000000
xmm15 = 0x00000000000000000000000000000000
xstate_bv = 0x7
ymmh0 = 0024848955fffea048ec8b484810ec83
ymmh1 = 08f883487f840f5848000000d8246489
ymmh2 = 54894870894868244860245c50246c89
ymmh3 = 30244c892454894c5c894c28894c2024
ymmh4 = ba495d106383b00000007fccc3028541
ymmh5 = fac1cbfffbc1411fd333411fc4d32341
ymmh6 = 007fca1cd48b4800f0e483487e6aba49
ymmh7 = 4c18246410246c892474894c3c894c08
ymmh8 = 00000000000000000000000000000000
ymmh9 = 00000000000000000000000000000000
ymmh10 = 00000000000000000000000000000000
ymmh11 = 00000000000000000000000000000000
ymmh12 = 00000000000000000000000000000000
ymmh13 = 00000000000000000000000000000000
ymmh14 = 00000000000000000000000000000000
ymmh15 = 00000000000000000000000000000000
oldmask=0x0000000000000000
cr2=0x00007fca1c811c60
master_signal_handler 11 returning now to 0x00007fcc1f682e00

Exit from fragment via code mod
thread 2868191 (flushtime 4444) walking pending deletion list (was_I_flushed==F0)
Considering #0: 0x00007fca1c66f000..0x00007fca1c8df000 flushtime 4445
dec => ref_count is now 1, flushtime diff is 0
Considering #1: 0x00007fca1c66f000..0x00007fca1c8df000 flushtime 4444
(aborting now since rest have already been ok-ed)
thread 2868191 done walking pending list @flushtime 4445
Flushed 0 frags

d_r_dispatch: target = 0x00007fcc63573817

derekbruening · 2021-07-21T15:09:46Z

Did not look in detail at logs but two thoughts:

gets two SIGUSR2 signals in a row

There was a bug fixed recently where DR would incorrectly nest a signal when the app did not set SA_NODEFER: #4998. Maybe worth testing w/ that fix if the issue seems to involve DR nesting and the app not handling nesting.

Second thought: DR uses SIGUSR2 for suspending threads. Maybe try swapping that to SIGUSR1 just to see if the issue involves a bug in DR's attempt to separate its own use from the app's.

If neither of those -- is ithe issue in app SIMD state discrepancy between signal queueing and delivery?

abudankov · 2021-07-21T18:31:54Z

Did not look in detail at logs but two thoughts:

gets two SIGUSR2 signals in a row

There was a bug fixed recently where DR would incorrectly nest a signal when the app did not set SA_NODEFER: #4998. Maybe worth testing w/ that fix if the issue seems to involve DR nesting and the app not handling nesting.

It makes sense if the fix is in build newer than DynamoRIO-Linux-8.0.18611-1. Is it?

Second thought: DR uses SIGUSR2 for suspending threads. Maybe try swapping that to SIGUSR1 just to see if the issue involves a bug in DR's attempt to separate its own use from the app's.
Already tried to change suspend/resume signal in JVM (to SIGPROF=27) and still observing the same behavior.

If neither of those -- is ithe issue in app SIMD state discrepancy between signal queueing and delivery?

It looks so. Before SIGUSR2s delivery the app SIMD state is valid and after that it is nullified. Diff in handling second SUGUSR2 looks like this:

handle_suspend_signal: suspended now
translate_from_synchall_to_dispatch: being translated from 0x00007fcc63b8322d
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fcc63b8322d

save_fpstate
thread_set_self_context: pc=0x00007fcc1f683d71
full sigcontext
Exit due to proactive reset

BTW, is it expected that store instruction accessing heap memory causes multiple SEGVs? I suppose it is a method to detect selfmod code?

derekbruening · 2021-07-22T03:28:45Z

There was a bug fixed recently where DR would incorrectly nest a signal when the app did not set SA_NODEFER: #4998. Maybe worth testing w/ that fix if the issue seems to involve DR nesting and the app not handling nesting.

It makes sense if the fix is in build newer than DynamoRIO-Linux-8.0.18611-1. Is it?

It is in 8.0.18824

Second thought: DR uses SIGUSR2 for suspending threads. Maybe try swapping that to SIGUSR1 just to see if the issue involves a bug in DR's attempt to separate its own use from the app's.
Already tried to change suspend/resume signal in JVM (to SIGPROF=27) and still observing the same behavior.

Try a non-itimer-associated signal? Thinking of #5017.

If neither of those -- is ithe issue in app SIMD state discrepancy between signal queueing and delivery?

It looks so. Before SIGUSR2s delivery the app SIMD state is valid and after that it is nullified. Diff in handling second SUGUSR2 looks like this:

handle_suspend_signal: suspended now
translate_from_synchall_to_dispatch: being translated from 0x00007fcc63b8322d
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fcc63b8322d

That looks like DR treating a SIGUSR2 as coming from itself rather than the app.

save_fpstate
thread_set_self_context: pc=0x00007fcc1f683d71
full sigcontext
Exit due to proactive reset

I would again disable reset to keep things simple. A reset will use SIGUSR2 to suspend all the threads.

BTW, is it expected that store instruction accessing heap memory causes multiple SEGVs? I suppose it is a method to detect selfmod code?

A store accessing code-containing memory: yes. DR's invariant is that all code is either read-only or sandboxed, so it keeps it read-only and handles the fault on a write by the app. There is a threshold of fault instances at which point it will bail on using page protection and switch to sandboxing.

abudankov · 2021-07-23T08:05:50Z

There was a bug fixed recently where DR would incorrectly nest a signal when the app did not set SA_NODEFER: #4998. Maybe worth testing w/ that fix if the issue seems to involve DR nesting and the app not handling nesting.

It makes sense if the fix is in build newer than DynamoRIO-Linux-8.0.18611-1. Is it?

It is in 8.0.18824

With this build that memory corruption have not reproduced since yesterday. Continuing to reproduce other issues.

Second thought: DR uses SIGUSR2 for suspending threads. Maybe try swapping that to SIGUSR1 just to see if the issue involves a bug in DR's attempt to separate its own use from the app's.
Already tried to change suspend/resume signal in JVM (to SIGPROF=27) and still observing the same behavior.

Try a non-itimer-associated signal? Thinking of #5017.

Clarified that our scenario JVM doesn't use this signal either.

If neither of those -- is ithe issue in app SIMD state discrepancy between signal queueing and delivery?

It looks so. Before SIGUSR2s delivery the app SIMD state is valid and after that it is nullified. Diff in handling second SUGUSR2 looks like this:
handle_suspend_signal: suspended now
translate_from_synchall_to_dispatch: being translated from 0x00007fcc63b8322d
handle_suspend_signal: awake now
master_signal_handler 12 returning now to 0x00007fcc63b8322d

That looks like DR treating a SIGUSR2 as coming from itself rather than the app.

Correct. Logs say that some other app thread translating futex syscall sends these two SIGUSR2s in a row.

save_fpstate
thread_set_self_context: pc=0x00007fcc1f683d71
full sigcontext
Exit due to proactive reset

I would again disable reset to keep things simple. A reset will use SIGUSR2 to suspend all the threads.

BTW, is it expected that store instruction accessing heap memory causes multiple SEGVs? I suppose it is a method to detect selfmod code?

A store accessing code-containing memory: yes. DR's invariant is that all code is either read-only or sandboxed, so it keeps it read-only and handles the fault on a write by the app. There is a threshold of fault instances at which point it will bail on using page protection and switch to sandboxing.

Ok. Clear. JVM generates a lot of dynamic code regions. Some part of that regions can then be moved around in memory as well as patched in place (relocations, optimizations like https://en.wikipedia.org/wiki/Inline_caching). Explicit notification of DynamoRIO framework by JVM about its dynamic code layout change could possibly simplify complexity of DynamoRIO implementation for some tricky corner cases as well as reduce runtime overhead in general (should be measured though).

derekbruening · 2021-07-23T14:14:56Z

Explicit notification of DynamoRIO framework by JVM about its dynamic code layout change could possibly simplify complexity of DynamoRIO implementation for some tricky corner cases as well as reduce runtime overhead in general (should be measured though).

Yes, we had an academic paper on this and a branch in the code base: but there have not been resources to merge the branch into the mainline. See https://dynamorio.org/page_jitopt.html

kuhanov · 2021-07-30T11:55:24Z

Hi @derekbruening.
I catched one more strange behaviour under DynamoRIO.
We have crashes inside java under dynamorio (they are herppened less with -debug key) like
SIGSEGV (0xb) at pc=0xffffffffffffffff, pid=1512988, tid=0x00007f8d7a0de700
I've tried to understanf where is this magic value from. Tried to replace all -1 defines to unique ones
I've replaced
#define FAKE_TAG ((app_pc)PTR_UINT_MINUS_3) // ~ 0xffffffffffffffc1
And got SIGSEGV (0xb) at pc=0xffffffffffffffc1, pid=1512988, tid=0x00007f8d7a0de700
So, magic value is FAKE_TAG.
Look like we try to remove fragment from inderect branch table (ftable->table[hindex].tag_fragment = FAKE_TAG;) but someone tried to use this fragment.
Maybe you have ideas how that could happen? What could be wrong here?
How could FAKE_TAG be the next execution instruction pc?
Thanks, Kirill

derekbruening · 2021-07-30T19:13:29Z

FAKE_TAG should never be the target of execution: the target_delete IBL entry is used on IBL removal; a special handler handles NULL. The tag_fragment field is the app pc, not the code cache pc, so it is never an execution target. So this does not make sense to me. Ideally we could enable LBR and read the last N branches from within gdb to see exactly how -1 was reached.

kuhanov · 2021-07-30T19:57:57Z

FAKE_TAG should never be the target of execution: the target_delete IBL entry is used on IBL removal; a special handler handles NULL. The tag_fragment field is the app pc, not the code cache pc, so it is never an execution target. So this does not make sense to me. Ideally we could enable LBR and read the last N branches from within gdb to see exactly how -1 was reached.

Did you mean gdb could setup and read lbr msrs? Or how does gdb read them?
Kirill

kuhanov · 2022-02-15T16:16:11Z

The problem is that I don't see the bb with ldaxr before stxr
something like this is absent in logs.

Are you sure it's not in some other thread's log or something. If you could find the branch that skips the ldaxr -- record dynamic instruction trace or something. Or run without DR and put a breakpoint on both the ldaxr and the stxr and see whether the stxr is ever reached w/o the ldaxr -- unfortunately there is no LBR access but that would be a confirmation.

try to reproduce with debug logs to understand the fragment chain.

kuhanov · 2022-02-16T15:26:29Z

Still could not reproduce hang in debug.
Just have log where we've got cut fragment with just store without load

no one fragment was linked with F204780(0x0000ffff9eba6e40)

d_r_dispatch: target = 0x0000ffff9eba8f58
Entry into F59790(0x0000ffff9eba8f58).0x0000ffff1b2cf740 (shared)
fcache_enter = 0x0000ffff1abf50c0, target = 0x0000ffff1b2cf73c
Exit from F1801(0x0000ffff9eba66a0).0x0000ffff1ac4875c (shared) (cannot link F1801->F108706) (cannot link shared to private)

d_r_dispatch: target = 0x0000ffff9eba6718
Entry into F108706(0x0000ffff9eba6718).0x0000ffff1b5a3104 
fcache_enter = 0x0000ffff1abf50c0, target = 0x0000ffff1b5a3100
Exit from F108706(0x0000ffff9eba6718).0x0000ffff1b5a3130  (cannot link F108706->F26404) (cannot link shared to private)

d_r_dispatch: target = 0x0000ffff9eba673c
Entry into F26404(0x0000ffff9eba673c).0x0000ffff1af018b8 (shared)
fcache_enter = 0x0000ffff1abf50c0, target = 0x0000ffff1af018b4
Exit from F108860(0x0000ffff9eba8f38).0x0000ffff1b5a3334  (cannot link F108860->F59790) (cannot link shared to private)

d_r_dispatch: target = 0x0000ffff9eba8f58
Entry into F59790(0x0000ffff9eba8f58).0x0000ffff1b2cf740 (shared)
fcache_enter = 0x0000ffff1abf50c0, target = 0x0000ffff1b2cf73c

master_signal_handler: thread=47481, sig=12, xsp=0x0000fff91e6c9da0, retaddr=0x000000000000000c
siginfo: sig = 12, pid = 45807, status = 0, errno = 0, si_code = -6

building bb instrlist now *********************

interp: start_pc = 0x0000ffff9eba6e38
check_thread_vm_area: pc = 0x0000ffff9eba6e38
check_thread_vm_area: check_stop = 0x0000ffff9ebcf158
  0x0000ffff9eba6e38  52800041   movz   $0x0002 lsl $0x00 -> %w1
  0x0000ffff9eba6e3c  885ffe60   ldaxr  (%x19)[4byte] -> %w0
  0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
  0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
end_pc = 0x0000ffff9eba6e48

setting cur_pc (for fall-through) to 0x0000ffff9eba6e48
forward_eflags_analysis: movz   $0x0002 lsl $0x00 -> %w1
	instr 0 => 0
forward_eflags_analysis: ldaxr  (%x19)[4byte] -> %w0
	instr 0 => 0
forward_eflags_analysis: stxr   %w1 -> (%x19)[4byte] %w2
	instr 0 => 0
Converting exclusive load @0x0000ffff9eba6e3c to regular
Using optimized same-block ldex-stex mangling
Converting exclusive store @0x0000ffff9eba6e40 to compare-and-swap
bb ilist after mangling:
TAG  0x0000ffff9eba6e38
 +0    L3 @0x0000ffff9eba6e38  52800041   movz   $0x0002 lsl $0x00 -> %w1
 +4    L4 @0x0000ffff9eba6e3c  88dffe60   ldar   (%x19)[4byte] -> %w0
 +8    m4 @0x0000ffff9eba6e3c  88dffe60   <label>
 +8    m4 @0x0000ffff9eba6e40  885ffe62   ldaxr  (%x19)[4byte] -> %w2
 +12   m4 @0x0000ffff9eba6e40  cb206042   sub    %x2 %x0 uxtx $0x0000000000000000 -> %x2
 +16   m4 @0x0000ffff9eba6e40  b5000002   cbnz   @0x0000fff91e6ab9e8[8byte] %x2
 +20   L3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +24   m4 @0x0000ffff9eba6e40  14000000   b      @0x0000fff91e6abc80[8byte]
 +28   m4 @0x0000ffff9eba6e40  14000000   <label>
 +28   m4 @0x0000ffff9eba6e40  d5033f5f   clrex  $0x000000000000000f
 +32   m3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +36   m4 @0x0000ffff9eba6e40  d5033f5f   <label>
 +36   L3 @0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
 +40   L4 @0x0000ffff9eba6e44  14000000   b      $0x0000ffff9eba6e48
END 0x0000ffff9eba6e38


done building bb instrlist *********************


building bb instrlist now *********************

interp: start_pc = 0x0000ffff9eba6e38
check_thread_vm_area: pc = 0x0000ffff9eba6e38
check_thread_vm_area: check_stop = 0x0000ffff9ebcf158
  0x0000ffff9eba6e38  52800041   movz   $0x0002 lsl $0x00 -> %w1
  0x0000ffff9eba6e3c  885ffe60   ldaxr  (%x19)[4byte] -> %w0
  0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
  0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
end_pc = 0x0000ffff9eba6e48

setting cur_pc (for fall-through) to 0x0000ffff9eba6e48
forward_eflags_analysis: movz   $0x0002 lsl $0x00 -> %w1
	instr 0 => 0
forward_eflags_analysis: ldaxr  (%x19)[4byte] -> %w0
	instr 0 => 0
forward_eflags_analysis: stxr   %w1 -> (%x19)[4byte] %w2
	instr 0 => 0
Converting exclusive load @0x0000ffff9eba6e3c to regular
Using optimized same-block ldex-stex mangling
Converting exclusive store @0x0000ffff9eba6e40 to compare-and-swap
bb ilist after mangling:
TAG  0x0000ffff9eba6e38
 +0    L3 @0x0000ffff9eba6e38  52800041   movz   $0x0002 lsl $0x00 -> %w1
 +4    L4 @0x0000ffff9eba6e3c  88dffe60   ldar   (%x19)[4byte] -> %w0
 +8    m4 @0x0000ffff9eba6e3c  88dffe60   <label>
 +8    m4 @0x0000ffff9eba6e40  885ffe62   ldaxr  (%x19)[4byte] -> %w2
 +12   m4 @0x0000ffff9eba6e40  cb206042   sub    %x2 %x0 uxtx $0x0000000000000000 -> %x2
 +16   m4 @0x0000ffff9eba6e40  b5000002   cbnz   @0x0000fff91e6ab418[8byte] %x2
 +20   L3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +24   m4 @0x0000ffff9eba6e40  14000000   b      @0x0000fff91e6aaca0[8byte]
 +28   m4 @0x0000ffff9eba6e40  14000000   <label>
 +28   m4 @0x0000ffff9eba6e40  d5033f5f   clrex  $0x000000000000000f
 +32   m3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +36   m4 @0x0000ffff9eba6e40  d5033f5f   <label>
 +36   L3 @0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
 +40   L4 @0x0000ffff9eba6e44  14000000   b      $0x0000ffff9eba6e48
END 0x0000ffff9eba6e38


done building bb instrlist *********************

Exit due to proactive reset

d_r_dispatch: target = 0x0000ffff9eba6e40

build_basic_block_fragment !!!!!!!!!!!!!!!!!!

interp: start_pc = 0x0000ffff9eba6e40
check_thread_vm_area: pc = 0x0000ffff9eba6e40
check_thread_vm_area: check_stop = 0x0000ffff9ebcf158
  0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
  0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
end_pc = 0x0000ffff9eba6e48

Converting exclusive store @0x0000ffff9eba6e40 to compare-and-swap
bb ilist after mangling:
TAG  0x0000ffff9eba6e40
 +0    m4 @0x0000000000000000  f9000380   str    %x0 -> (%x28)[8byte]
 +4    m4 @0x0000000000000000  f9405780   ldr    +0xa8(%x28)[8byte] -> %x0
 +8    m4 @0x0000000000000000  cb206262   sub    %x19 %x0 uxtx $0x0000000000000000 -> %x2
 +12   m4 @0x0000000000000000  b5000002   cbnz   @0x0000fff91e6ab920[8byte] %x2
 +16   m4 @0x0000000000000000  f9406380   ldr    +0xc0(%x28)[8byte] -> %x0
 +20   m4 @0x0000000000000000  d1001002   sub    %x0 $0x0000000000000004 lsl $0x0000000000000000 -> %x2
 +24   m4 @0x0000000000000000  b5000002   cbnz   @0x0000fff91e6ab920[8byte] %x2
 +28   m4 @0x0000000000000000  f9405b80   ldr    +0xb0(%x28)[8byte] -> %x0
 +32   m4 @0x0000000000000000  885ffe62   ldaxr  (%x19)[4byte] -> %w2
 +36   m4 @0x0000000000000000  cb206042   sub    %x2 %x0 uxtx $0x0000000000000000 -> %x2
 +40   m4 @0x0000000000000000  b5000002   cbnz   @0x0000fff91e6ab920[8byte] %x2
 +44   L3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +48   m4 @0x0000000000000000  14000000   b      @0x0000fff91e6ab7a0[8byte]
 +52   m4 @0x0000000000000000  14000000   <label>
 +52   m4 @0x0000000000000000  d5033f5f   clrex  $0x000000000000000f
 +56   m3 @0x0000ffff9eba6e40  88027e61   stxr   %w1 -> (%x19)[4byte] %w2
 +60   m4 @0x0000000000000000  d5033f5f   <label>
 +60   m4 @0x0000000000000000  f9400380   ldr    (%x28)[8byte] -> %x0
 +64   L3 @0x0000ffff9eba6e44  35ffffc2   cbnz   $0x0000ffff9eba6e3c %w2
 +68   L4 @0x0000000000000000  14000000   b      $0x0000ffff9eba6e48
END 0x0000ffff9eba6e40

linking new fragment F204780(0x0000ffff9eba6e40)
  linking incoming links for F204780(0x0000ffff9eba6e40)
  linking outgoing links for F204780(0x0000ffff9eba6e40)
    linking F204780(0x0000ffff9eba6e40).0x0000ffff1bcac8d4 -> F136578(0x0000ffff9eba6e3c)=0x0000ffff1b755c34
    add incoming F204780(0x0000ffff9eba6e40).0x0000ffff1bcac8d4 -> F136578(0x0000ffff9eba6e3c)
    linking F204780(0x0000ffff9eba6e40).0x0000ffff1bcac8d8 -> F26394(0x0000ffff9eba6e48)=0x0000ffff1af014fc
    add incoming F204780(0x0000ffff9eba6e40).0x0000ffff1bcac8d8 -> F26394(0x0000ffff9eba6e48)

Entry into F204780(0x0000ffff9eba6e40).0x0000ffff1bcac894 (shared)
fcache_enter = 0x0000ffff1abf50c0, target = 0x0000ffff1bcac890
Exit from F1801(0x0000ffff9eba66a0).0x0000ffff1ac4875c (shared) (cannot link F1801->F108706) (cannot link shared to private)

kuhanov · 2022-02-16T15:36:39Z

run the same workload under gdb with breakpoints in pthrea_mutex_lock monitor region

try to catch 1000001 times. all stores had loads.

Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x0000ffffbf67df94 <pthread_mutex_lock+52>
        breakpoint already hit 1000001 times
3       breakpoint     keep y   0x0000ffffbf67dfa0 <pthread_mutex_lock+64>
        breakpoint already hit 1000001 times

derekbruening · 2022-02-16T15:43:05Z

Not reproducing in debug reminds me of some tests that are hanging in release on AArch64 but never hang in debug: #4928, e.g. We were going to try to figure that out soon; maybe we can get lucky and it will be the same underlying problem as here.

derekbruening · 2022-02-16T15:48:03Z

For your logs in #3733 (comment) the explanation is this line:

Exit due to proactive reset

So DR suspended a thread in between the ldaxr and stxr and redirected it to start executing at a new block that tail-duplicates the original. So dynamically there was a ldaxr before the stxr; DR just made a split block for the suspend-and-relocate.

kuhanov · 2022-02-17T17:24:15Z

Not reproducing in debug reminds me of some tests that are hanging in release on AArch64 but never hang in debug: #4928, e.g. We were going to try to figure that out soon; maybe we can get lucky and it will be the same underlying problem as here.

Hi, @derekbruening.
this patch 3c846da?

derekbruening · 2022-02-17T17:27:12Z

Not reproducing in debug reminds me of some tests that are hanging in release on AArch64 but never hang in debug: #4928, e.g. We were going to try to figure that out soon; maybe we can get lucky and it will be the same underlying problem as here.

Hi, @derekbruening. this patch 3c846da?

Yes PR #5367 fixes one hang we found that reproduced in release build but not debug (just b/c of timing). There are more though: drcachesim online (#4928) and there are some code-inspection issues #2502. Still, it is worth trying with the PR #5367 patch that was just merged to see if that helps these Java apps.

kuhanov · 2022-02-18T14:18:45Z

Remove all my workarounds (prohibition splitting inside monitor region nad so on) and apply this patch. Had one hang on 2000 runs. Previous was about 2-3 hangs on 100 time run.
Kirill

derekbruening · 2022-02-18T16:40:47Z

Sounds like progress. There's also PR #5370 and PR #5375.

kuhanov · 2022-02-20T06:47:21Z

Sounds like progress. There's also PR #5370 and PR #5375.

These patches didn't help better, the same hang frequency

kuhanov · 2022-02-25T08:31:21Z

So DR suspended a thread in between the ldaxr and stxr and redirected it to start executing at a new block that tail-duplicates the original. So dynamically there was a ldaxr before the stxr; DR just made a split block for the suspend-and-relocate.

Hi.
Still haunts the question here. If it was so and we just suspended thread between load and store, we should have the same counters in rstats statistics but thay are different in many runs

 Load-exclusive instrs converted to CAS :              56721
 Store-exclusive instrs converted to CAS :             56686

Kirill

kuhanov · 2022-03-04T14:07:36Z

Hi @derekbruening
We have SIGSEGV crash case on AArch64 again

java report

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff93f12c4c, pid=630238, tid=0x0000fff1010621e0
#
# JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0-internal)
# Java VM: OpenJDK 64-Bit Server VM (25.71-b00 mixed mode linux-aarch64 )
# Problematic frame:
# V  [libjvm.so+0x562c4c]  PhaseChaitin::build_ifg_physical(ResourceArea*)+0x42c
#
# Failed to write core dump..
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

crash context

Registers:
R0=0x0000fff08c1b10b0
R1=0x0000fff08c1a2f40
R2=0xffffffff945d8290
R3=0x0000000000000007
R4=0x0000000000000002
R5=0x0000000000000000
R6=0x0000ffff946add20
R7=0x0000000000000000
R8=0x0000fff1010628e0
R9=0x0000000000000000
R10=0x00000000ffffffff
R11=0x0000000000000000
R12=0x0000000000000000
R13=0x0000000000000000
R14=0x0000000000000000
R15=0x0000000000000000
R16=0x0000000000000000
R17=0x0000000000000000
R18=0x0000000000000000
R19=0x0000fff08c1b0f90
R20=0x00000000000f423f
R21=0x0000fff08c03efb0
R22=0x0000fff10105efb0
R23=0x0000000000000018
R24=0x0000000000000068
R25=0x0000000000000000
R26=0x0000fff10105efb0
R27=0x0000ffff946add20
R28=0x0000000000000001
R29=0x0000fff10105e9d0
R30=0x0000ffff93f12b8c

The same SIGSEGV in DynamoRIO logs

computing memory target for 0x0000ffff115a5e8c causing SIGSEGV, kernel claims it is 0x0000ffeedd903980
compute_memory_target: falling back to racy protection checks
opnd_compute_address for: (%x1,%x2,lsl #2)
        base => 0x0000fff08c1a2f40
        index,scale => 0x0000ffeedd903980
        disp => 0x0000ffeedd903980
For SIGSEGV at cache pc 0x0000ffff115a5e8c, computed target read 0x0000ffeedd903980
        faulting instr: ldr    (%x1,%x2,lsl #2)[4byte] -> %w25
** Received SIGSEGV at cache pc 0x0000ffff115a5e8c in thread 630480

register context is the same like java reported

$10 = {uc_flags = 0x0, uc_link = 0x0, uc_stack = {ss_sp = 0x0, ss_flags = 0x2, ss_size = 0x0}, uc_sigmask = {__val = {0x4, 0xabababababababab <repeats 15 times>}},
  uc_mcontext = {fault_address = 0xffeedd903980, regs = {0xfff08c1b10b0, 0xfff08c1a2f40, 0xffffffff945d8290, 0x7, 0x2, 0x0, 0xffff946add20, 0x0, 0xfff1010628e0, 0x0,
      0xffffffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfff08c1b0f90, 0xf423f, 0xfff08c03efb0, 0xfff10105efb0, 0x18, 0x68, 0x0, 0xfff10105efb0, 0xffff946add20, 0x1,
      0xfff10105e9d0, 0xffff93f12b8c}, sp = 0xfff10105e9d0, pc = 0xffff93f12c4c, pstate = 0x80000000, __reserved = {0x1, 0x80, 0x50, 0x46, 0x10, 0x2, 0x0, 0x0, 0x10, 0x0,
      0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x40, 0x6e, 0xe9, 0xea, 0x3f, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x68, 0x44, 0x55, 0x1c, 0xe6, 0x93, 0x12, 0x40,
      0x0 <repeats 31 times>, 0xc, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x88, 0xf2, 0x1a, 0x8c, 0xf0, 0xff, 0x0, 0x0, 0x80, 0x4d, 0x1, 0x8c, 0xf0, 0xff, 0x0, 0x0, 0x10,
      0x4d, 0x1, 0x8c, 0xf0, 0xff, 0x0, 0x0, 0xa8, 0x4d, 0x29, 0x8c, 0xf0, 0xff, 0x0, 0x0, 0xb8, 0x39, 0x5, 0x80, 0xf0, 0xff, 0x0, 0x0, 0x38, 0x37, 0x5, 0x80, 0xf0, 0xff,
      0x0, 0x0, 0x2, 0x8, 0x20, 0x80, 0x2, 0x8, 0x20, 0x80, 0x2, 0x8, 0x20, 0x80, 0x2, 0x8, 0x20, 0x80, 0x0, 0x0, 0x0, 0x40, 0x6e, 0xe9, 0xea, 0x3f,
      0x0 <repeats 120 times>, 0x1, 0x4, 0x10, 0x40, 0x1, 0x4, 0x10, 0x40, 0x1, 0x4, 0x10, 0x40, 0x1, 0x4, 0x10, 0x40, 0x10, 0x0, 0xaa, 0xaa, 0x41, 0x0, 0x0, 0x10, 0x1,
      0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x1, 0x0, 0x0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x10, 0x0, 0x0, 0x3, 0x0 <repeats 14 times>...}}}

Basic Block and mangling bb

d_r_dispatch: target = 0x0000ffff93f12c3c

interp: start_pc = 0x0000ffff93f12c3c
check_thread_vm_area: pc = 0x0000ffff93f12c3c
check_thread_vm_area: check_stop = 0x0000ffff946d6408
  0x0000ffff93f12c3c  f9400660   ldr    +0x08(%x19)[8byte] -> %x0
  0x0000ffff93f12c40  f94096c1   ldr    +0x0128(%x22)[8byte] -> %x1
  0x0000ffff93f12c44  f87c5800   ldr    (%x0,%w28,uxtw #3)[8byte] -> %x0
  0x0000ffff93f12c48  b9802802   ldrsw  +0x28(%x0)[4byte] -> %x2      
  0x0000ffff93f12c4c  b8627839   ldr    (%x1,%x2,lsl #2)[4byte] -> %w25 <<<============ CRASH
  0x0000ffff93f12c50  34ffff19   cbz    $0x0000ffff93f12c30 %w25
end_pc = 0x0000ffff93f12c54

skip save stolen reg app value for: ldr    (%x0,%w28,uxtw #3)[8byte] -> %x0
bb ilist after mangling:
TAG  0x0000ffff93f12c3c
 +0    L3 @0x0000fff9116d9538  f9400660   ldr    +0x08(%x19)[8byte] -> %x0
 +4    L3 @0x0000fff91150c4d8  f94096c1   ldr    +0x0128(%x22)[8byte] -> %x1
 +8    m4 @0x0000fff9116d9128  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +12   m4 @0x0000fff9116d9438  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +16   m4 @0x0000fff91150cdc0  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +20   L3 @0x0000fff91150c290  f87c5800   ldr    (%x0,%w28,uxtw #3)[8byte] -> %x0
 +24   m4 @0x0000fff9116d6458  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
 +28   m4 @0x0000fff9116d6060  f9400781   ldr    +0x08(%x28)[8byte] -> %x1
 +32   L3 @0x0000fff9116d9028  b9802802   ldrsw  +0x28(%x0)[4byte] -> %x2  
 +36   L3 @0x0000fff91150c930  b8627839   ldr    (%x1,%x2,lsl #2)[4byte] -> %w25 <<<============ CRASH
 +40   L3 @0x0000fff91150cf08  34ffff19   cbz    $0x0000ffff93f12c30 %w25
 +44   L4 @0x0000fff9116d91f0  14000000   b      $0x0000ffff93f12c54
END 0x0000ffff93f12c3c

Look at crashes instruction
+36 L3 @0x0000fff91150c930 b8627839 ldr (%x1,%x2,lsl #2)[4byte] -> %w25 <<<============ CRASH
We got fault_address=0xffeedd903980 if we use register context x2=0xffffffff945d8290; x1=0xfff08c1a2f40

(gdb) p /x  0xffffffff945d8290<<2
$11 = 0xfffffffe51760a40
(gdb) p /x  (0xfff08c1a2f40+0xfffffffe51760a40)
$12 = 0xffeedd903980

BUT let's look at the previous instruction
0x0000ffff93f12c48 b9802802 ldrsw +0x28(%x0)[4byte] -> %x2
x0=0xfff08c1b10b0

(gdb) x /gx (0xfff08c1b10b0+0x28)
0xfff08c1b10d8: 0x0000ffff945d8290

So, ldrsw instruction must set x2=0x945d8290. it should not be x2=0xffffffff945d8290
the CRASH instruction is ok if x2=0x945d8290

(gdb) p /x  0x945d8290<<2
$16 = 0x51760a40
(gdb) p /x  (0xfff08c1a2f40+0x51760a40)
$17 = 0xfff0dd903980
(gdb) x /gx (0xfff08c1a2f40+0x51760a40)
0xfff0dd903980: 0x0000000000000000

Does DRIO make some internal job here? What could be wrong?
I could not catch why x2 register is incorrect.
Thanks, Kirill

kuhanov · 2022-03-09T09:26:51Z

So, ldrsw instruction must set x2=0x945d8290. it should not be x2=0xffffffff945d8290

oh, ldrsw is signed, so x2 could be 0xffffffff945d8290
continue investigation what could be wrong here.

kuhanov · 2022-03-15T15:21:35Z

Hi @derekbruening.
Could you help me to understand the crash?
We've got synchro signal on thread.

main_signal_handler: thread=1588266, sig=12, xsp=0x0000fff923c94da0, retaddr=0x000000000000000c
siginfo: sig = 12, pid = 1587929, status = 0, errno = 0, si_code = -6
        x0     = 0x0000000000000000
        x1     = 0x0000fff923c6e000
        x2     = 0x000000000000000c
        x3     = 0x0000000000000030
        x4     = 0x000000000000005c
        x5     = 0x0000000000003c05
        x6     = 0x0000fff09015a9b8
        x7     = 0xfefeff6f6071735e
        x8     = 0x7f7f7f7f7f7f7f7f
        x9     = 0x0000000000000000
        x10    = 0x0101010101010101
        x11    = 0x0000000000000028
        x12    = 0x0000a701409d1276
        x13    = 0x0000000000000040
        x14    = 0x000000000000003f
        x15    = 0x0000000000000000
        x16    = 0x0000ffffa651dc00
        x17    = 0x0000ffffa6bc4080
        x18    = 0x0000000000000000
        x19    = 0x0000000000000030
        x20    = 0x0000fff090484208
        x21    = 0x0000fff0901b94f8
        x22    = 0x0000ffffa6600340
        x23    = 0x0000000000000001
        x24    = 0x0000000000000021
        x25    = 0x0000fff09045a8f8
        x26    = 0x0000000000000021
        x27    = 0x0000000000000108
        x28    = 0x0000fff923c6e000
        x29    = 0x0000fff106572880
        x30    = 0x0000ffffa635068c
        sp     = 0x0000fff106572880
        pc     = 0x0000ffff238c68c8
        pstate = 0x0000000020000000

pc is 0x0000ffff238c68c8

code cache for the bb looks like

(gdb) x /16i (0x0000ffff238c68c8-48)
   0xffff238c6898:      ldr     x0, [x25, #8]
   0xffff238c689c:      str     x0, [x28]
   0xffff238c68a0:      mov     x0, x28
   0xffff238c68a4:      ldr     x28, [x28, #48]
   0xffff238c68a8:      lsl     x27, x28, #3
   0xffff238c68ac:      mov     x28, x0
   0xffff238c68b0:      ldr     x0, [x28]
   0xffff238c68b4:      str     x1, [x28, #8]
   0xffff238c68b8:      mov     x1, x28
   0xffff238c68bc:      ldr     x28, [x28, #48]
   0xffff238c68c0:      ldr     x0, [x0, x28, lsl #3]
   0xffff238c68c4:      mov     x28, x1
==>   0xffff238c68c8:      ldr     x1, [x28, #8]   <==
   0xffff238c68cc:      cmp     x20, x0
   0xffff238c68d0:      b.eq    0xffff238c6de8  // b.none
   0xffff238c68d4:      b       0xffff238c6a68

clear bb and bb after mangling

interp: start_pc = 0x0000ffffa6350424
check_thread_vm_area: pc = 0x0000ffffa6350424
check_thread_vm_area: check_stop = 0x0000ffffa6b02888
  0x0000ffffa6350424  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
  0x0000ffffa6350428  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
  0x0000ffffa635042c  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
  0x0000ffffa6350430  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
  0x0000ffffa6350434  54000340   b.eq   $0x0000ffffa635049c
end_pc = 0x0000ffffa6350438

bb ilist after mangling:
TAG  0x0000ffffa6350424
 +0    L3 @0x0000fff923eafda0  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
 +4    m4 @0x0000fff923eb1110  f9000380   str    %x0 -> (%x28)[8byte]
 +8    m4 @0x0000fff923eb1df0  aa1c03e0   orr    %xzr %x28 lsl $0x0000000000000000 -> %x0
 +12   m4 @0x0000fff923eb1358  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +16   L3 @0x0000fff923eb0430  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
 +20   m4 @0x0000fff923eb1090  aa0003fc   orr    %xzr %x0 lsl $0x0000000000000000 -> %x28
 +24   m4 @0x0000fff923eae950  f9400380   ldr    (%x28)[8byte] -> %x0
 +28   m4 @0x0000fff923eaf438  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +32   m4 @0x0000fff923eae9d0  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +36   m4 @0x0000fff923eaeb18  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +40   L3 @0x0000fff923eaf0a8  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
 +44   m4 @0x0000fff923eafd20  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
==> +48   m4 @0x0000fff923eb00e8  f9400781   ldr    +0x08(%x28)[8byte] -> %x1     <==
 +52   L3 @0x0000fff923eb12d8  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
 +56   L3 @0x0000fff923eb20b8  54000340   b.eq   $0x0000ffffa635049c
 +60   L4 @0x0000fff923eb1e70  14000000   b      $0x0000ffffa6350438
END 0x0000ffffa6350424

So, pc 0x0000ffff238c68c8 is mangling m4 instruction ldr +0x08(%x28)[8byte] -> %x1

When the thread was awake, dispatcher set target 0x0000ffffa635042c

handle_suspend_signal: awake now
        main_signal_handler 12 returning now to 0x0000ffff22d11454
Exit due to proactive reset

d_r_dispatch: target = 0x0000ffffa635042c

Building new bb

interp: start_pc = 0x0000ffffa635042c
check_thread_vm_area: pc = 0x0000ffffa635042c
check_thread_vm_area: check_stop = 0x0000ffffa6b02888
==>  0x0000ffffa635042c  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0 <==
  0x0000ffffa6350430  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
  0x0000ffffa6350434  54000340   b.eq   $0x0000ffffa635049c
end_pc = 0x0000ffffa6350438

bb ilist after mangling:
TAG  0x0000ffffa635042c
 +0    m4 @0x0000fff923eb1110  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +4    m4 @0x0000fff923eb12d8  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +8    m4 @0x0000fff923eb1df0  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
==> +12   L3 @0x0000fff923eaf0a8  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0 <==
 +16   m4 @0x0000fff923eb1358  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
*** +20   m4 @0x0000fff923eb0430  f9400781   ldr    +0x08(%x28)[8byte] -> %x1 ***
 +24   L3 @0x0000fff923eafd20  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
 +28   L3 @0x0000fff923eb00e8  54000340   b.eq   $0x0000ffffa635049c
 +32   L4 @0x0000fff923eafda0  14000000   b      $0x0000ffffa6350438
END 0x0000ffffa635042c

Looks like we back to 1st original instruction ldr (%x0,%x28,lsl #3)[8byte] -> %x0 before our mangle ldr +0x08(%x28)[8byte] -> %x1
but probably register context was not restored and x0 register has incorrect value

crash signal context

main_signal_handler: thread=1588266, sig=11, xsp=0x0000fff923c94da0, retaddr=0x000000000000000b
siginfo: sig = 11, pid = 264, status = 0, errno = 0, si_code = 1
          x0     = 0x0000000000000000
          x1     = 0x0000fff923c6e000
          x2     = 0x000000000000000c
          x3     = 0x0000000000000030
          x4     = 0x000000000000005c
          x5     = 0x0000000000003c05
          x6     = 0x0000fff09015a9b8
          x7     = 0xfefeff6f6071735e
          x8     = 0x7f7f7f7f7f7f7f7f
          x9     = 0x0000000000000000
          x10    = 0x0101010101010101
          x11    = 0x0000000000000028
          x12    = 0x0000a701409d1276
          x13    = 0x0000000000000040
          x14    = 0x000000000000003f
          x15    = 0x0000000000000000
          x16    = 0x0000ffffa651dc00
          x17    = 0x0000ffffa6bc4080
          x18    = 0x0000000000000000
          x19    = 0x0000000000000030
          x20    = 0x0000fff090484208
          x21    = 0x0000fff0901b94f8
          x22    = 0x0000ffffa6600340
          x23    = 0x0000000000000001
          x24    = 0x0000000000000021
          x25    = 0x0000fff09045a8f8
          x26    = 0x0000000000000021
          x27    = 0x0000000000000108
          x28    = 0x0000000000000021
          x29    = 0x0000fff106572880
          x30    = 0x0000ffffa635068c
          sp     = 0x0000fff106572880
          pc     = 0x0000ffff2417046c
          pstate = 0x0000000020000000

computing memory target for 0x0000ffff2417046c causing SIGSEGV, kernel claims it is 0x0000000000000108
compute_memory_target: falling back to racy protection checks
opnd_compute_address for: (%x0,%x28,lsl #3)
          base => 0x0000000000000000
          index,scale => 0x0000000000000108
          disp => 0x0000000000000108
For SIGSEGV at cache pc 0x0000ffff2417046c, computed target read 0x0000000000000108
          faulting instr: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
** Received SIGSEGV at cache pc 0x0000ffff2417046c in thread 1588266
record_pending_signal(11) from cache pc 0x0000ffff2417046c
          not certain can delay so handling now
          action is not SIG_IGN


(gdb) x /9i (0x0000ffff2417046c-12)
   0xffff24170460:      str     x1, [x28, #8]
   0xffff24170464:      mov     x1, x28
   0xffff24170468:      ldr     x28, [x28, #48]
   0xffff2417046c:      ldr     x0, [x0, x28, lsl #3]
   0xffff24170470:      mov     x28, x1
   0xffff24170474:      ldr     x1, [x28, #8]
   0xffff24170478:      cmp     x20, x0
   0xffff2417047c:      b.eq    0xffff24170484  // b.none
   0xffff24170480:      b       0xffff238c6a68

Do I understand correctly the following:

before synchro signal we executed ldr x0, [x0, x28, lsl #3] and change x0
after the signal, we back from ldr x1, [x28, #8] to ldr x0, [x0, x28, lsl #3] but don't restore context
we execute ldr x0, [x0, x28, lsl #3] the 2nd time but with incorrect register context

is it possible that we could not restore context?
or am I wrong here?
Thanks, Kirill

derekbruening · 2022-03-15T15:44:37Z

You would expect this to be marked as a mangling epilogue. Translation in a mangling epilogue is supposed to target the next instruction and "emulate" the rest of the epilogue, as it is sometimes impossible to undo the app instr and thus returning the being-mangled instr PC for restart is not feasible. This makes it look like that is not done correctly for stolen register mangling on AArch64. I would suggest filing a separate issue to focus on this.

kuhanov · 2022-03-16T09:01:39Z

I would suggest filing a separate issue to focus on this.
ok - #5426

kuhanov · 2022-03-16T12:02:00Z

It will be great to have some workaround here. this crash is reproduced too often on the pool of our jvm workloads. ((
Kirill

kuhanov · 2022-03-16T14:52:43Z

Hi, @derekbruening.
One more question here.
Before handle_suspend_signal: suspended now and handle_suspend_signal: awake now
DynamoRIO calls recreate_bb_ilist procedure 2 times and recreate original signaled basic block.
Why does it build it? Looks like it doesn't use it after signal because it recreates cutted bb from the last original app instruction
Thanks, Kirill

log example

main_signal_handler: thread=1588266, sig=12, xsp=0x0000fff923c94da0, retaddr=0x000000000000000c
siginfo: sig = 12, pid = 1587929, status = 0, errno = 0, si_code = -6
        x0     = 0x0000000000000000
        x1     = 0x0000fff923c6e000
        x2     = 0x000000000000000c
        x3     = 0x0000000000000030
        x4     = 0x000000000000005c
        x5     = 0x0000000000003c05
        x6     = 0x0000fff09015a9b8
        x7     = 0xfefeff6f6071735e
        x8     = 0x7f7f7f7f7f7f7f7f
        x9     = 0x0000000000000000
        x10    = 0x0101010101010101
        x11    = 0x0000000000000028
        x12    = 0x0000a701409d1276
        x13    = 0x0000000000000040
        x14    = 0x000000000000003f
        x15    = 0x0000000000000000
        x16    = 0x0000ffffa651dc00
        x17    = 0x0000ffffa6bc4080
        x18    = 0x0000000000000000
        x19    = 0x0000000000000030
        x20    = 0x0000fff090484208
        x21    = 0x0000fff0901b94f8
        x22    = 0x0000ffffa6600340
        x23    = 0x0000000000000001
        x24    = 0x0000000000000021
        x25    = 0x0000fff09045a8f8
        x26    = 0x0000000000000021
        x27    = 0x0000000000000108
        x28    = 0x0000fff923c6e000
        x29    = 0x0000fff106572880
        x30    = 0x0000ffffa635068c
        sp     = 0x0000fff106572880
        pc     = 0x0000ffff238c68c8
        pstate = 0x0000000020000000
dcontext next tag = 0x0000ffff240d3d8c
handle_suspend_signal: suspended now

building bb instrlist now *********************

interp: start_pc = 0x0000ffffa6350424
check_thread_vm_area: pc = 0x0000ffffa6350424
check_thread_vm_area: check_stop = 0x0000ffffa6b02888
  0x0000ffffa6350424  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
  0x0000ffffa6350428  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
  0x0000ffffa635042c  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
  0x0000ffffa6350430  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
  0x0000ffffa6350434  54000340   b.eq   $0x0000ffffa635049c
end_pc = 0x0000ffffa6350438

setting cur_pc (for fall-through) to 0x0000ffffa6350438
forward_eflags_analysis: ldr    +0x08(%x25)[8byte] -> %x0
        instr 0 => 0
forward_eflags_analysis: ubfm   %x28 $0x3d $0x3c -> %x27
        instr 0 => 0
forward_eflags_analysis: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
        instr 0 => 0
forward_eflags_analysis: subs   %x20 %x0 lsl $0x00 -> %xzr
        instr 3c0 => 0
skip save stolen reg app value for: ubfm   %x28 $0x3d $0x3c -> %x27
skip save stolen reg app value for: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
bb ilist after mangling:
TAG  0x0000ffffa6350424
 +0    L3 @0x0000fff923eafda0  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
 +4    m4 @0x0000fff923eb1110  f9000380   str    %x0 -> (%x28)[8byte]
 +8    m4 @0x0000fff923eb1df0  aa1c03e0   orr    %xzr %x28 lsl $0x0000000000000000 -> %x0
 +12   m4 @0x0000fff923eb1358  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +16   L3 @0x0000fff923eb0430  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
 +20   m4 @0x0000fff923eb1090  aa0003fc   orr    %xzr %x0 lsl $0x0000000000000000 -> %x28
 +24   m4 @0x0000fff923eae950  f9400380   ldr    (%x28)[8byte] -> %x0
 +28   m4 @0x0000fff923eaf438  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +32   m4 @0x0000fff923eae9d0  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +36   m4 @0x0000fff923eaeb18  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +40   L3 @0x0000fff923eaf0a8  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
 +44   m4 @0x0000fff923eafd20  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
 +48   m4 @0x0000fff923eb00e8  f9400781   ldr    +0x08(%x28)[8byte] -> %x1
 +52   L3 @0x0000fff923eb12d8  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
 +56   L3 @0x0000fff923eb20b8  54000340   b.eq   $0x0000ffffa635049c
 +60   L4 @0x0000fff923eb1e70  14000000   b      $0x0000ffffa6350438
END 0x0000ffffa6350424


done building bb instrlist *********************


building bb instrlist now *********************

interp: start_pc = 0x0000ffffa6350424
check_thread_vm_area: pc = 0x0000ffffa6350424
check_thread_vm_area: check_stop = 0x0000ffffa6b02888
  0x0000ffffa6350424  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
  0x0000ffffa6350428  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
  0x0000ffffa635042c  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
  0x0000ffffa6350430  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
  0x0000ffffa6350434  54000340   b.eq   $0x0000ffffa635049c
end_pc = 0x0000ffffa6350438

setting cur_pc (for fall-through) to 0x0000ffffa6350438
forward_eflags_analysis: ldr    +0x08(%x25)[8byte] -> %x0
        instr 0 => 0
forward_eflags_analysis: ubfm   %x28 $0x3d $0x3c -> %x27
        instr 0 => 0
forward_eflags_analysis: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
        instr 0 => 0
forward_eflags_analysis: subs   %x20 %x0 lsl $0x00 -> %xzr
        instr 3c0 => 0
skip save stolen reg app value for: ubfm   %x28 $0x3d $0x3c -> %x27
skip save stolen reg app value for: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
bb ilist after mangling:
TAG  0x0000ffffa6350424
 +0    L3 @0x0000fff923eb1e70  f9400720   ldr    +0x08(%x25)[8byte] -> %x0
 +4    m4 @0x0000fff923eaeb18  f9000380   str    %x0 -> (%x28)[8byte]
 +8    m4 @0x0000fff923eae9d0  aa1c03e0   orr    %xzr %x28 lsl $0x0000000000000000 -> %x0
 +12   m4 @0x0000fff923eaf438  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +16   L3 @0x0000fff923eb20b8  d37df39b   ubfm   %x28 $0x3d $0x3c -> %x27
 +20   m4 @0x0000fff923eae950  aa0003fc   orr    %xzr %x0 lsl $0x0000000000000000 -> %x28
 +24   m4 @0x0000fff923eb1090  f9400380   ldr    (%x28)[8byte] -> %x0
 +28   m4 @0x0000fff923eb0430  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +32   m4 @0x0000fff923eb1358  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +36   m4 @0x0000fff923eb1df0  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +40   L3 @0x0000fff923eb12d8  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
 +44   m4 @0x0000fff923eb1110  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
 +48   m4 @0x0000fff923eafda0  f9400781   ldr    +0x08(%x28)[8byte] -> %x1
 +52   L3 @0x0000fff923eb00e8  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
 +56   L3 @0x0000fff923eafd20  54000340   b.eq   $0x0000ffffa635049c
 +60   L4 @0x0000fff923eaf0a8  14000000   b      $0x0000ffffa6350438
END 0x0000ffffa6350424


done building bb instrlist *********************


handle_suspend_signal: awake now
        main_signal_handler 12 returning now to 0x0000ffff22d11454

Exit due to proactive reset

d_r_dispatch: target = 0x0000ffffa635042c

interp: start_pc = 0x0000ffffa635042c
check_thread_vm_area: pc = 0x0000ffffa635042c
check_thread_vm_area: check_stop = 0x0000ffffa6b02888
  0x0000ffffa635042c  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
  0x0000ffffa6350430  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
  0x0000ffffa6350434  54000340   b.eq   $0x0000ffffa635049c
end_pc = 0x0000ffffa6350438

skip save stolen reg app value for: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
bb ilist after mangling:
TAG  0x0000ffffa635042c
 +0    m4 @0x0000fff923eb1110  f9000781   str    %x1 -> +0x08(%x28)[8byte]
 +4    m4 @0x0000fff923eb12d8  aa1c03e1   orr    %xzr %x28 lsl $0x0000000000000000 -> %x1
 +8    m4 @0x0000fff923eb1df0  f9401b9c   ldr    +0x30(%x28)[8byte] -> %x28
 +12   L3 @0x0000fff923eaf0a8  f87c7800   ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
 +16   m4 @0x0000fff923eb1358  aa0103fc   orr    %xzr %x1 lsl $0x0000000000000000 -> %x28
 +20   m4 @0x0000fff923eb0430  f9400781   ldr    +0x08(%x28)[8byte] -> %x1
 +24   L3 @0x0000fff923eafd20  eb00029f   subs   %x20 %x0 lsl $0x00 -> %xzr
 +28   L3 @0x0000fff923eb00e8  54000340   b.eq   $0x0000ffffa635049c
 +32   L4 @0x0000fff923eafda0  14000000   b      $0x0000ffffa6350438
END 0x0000ffffa635042c

linking new fragment F559721(0x0000ffffa635042c)
  linking incoming links for F559721(0x0000ffffa635042c)
  linking outgoing links for F559721(0x0000ffffa635042c)
    linking F559721(0x0000ffffa635042c).0x0000ffff2417047c -> F127689(0x0000ffffa635049c)=0x0000ffff238c6de8
    add incoming F559721(0x0000ffffa635042c).0x0000ffff2417047c -> F127689(0x0000ffffa635049c)
    linking F559721(0x0000ffffa635042c).0x0000ffff24170480 -> F127682(0x0000ffffa6350438)=0x0000ffff238c6a68
    add incoming F559721(0x0000ffffa635042c).0x0000ffff24170480 -> F127682(0x0000ffffa6350438)
priv_mcontext_t @0x0000fff92338d880
        r0  = 0x0000000000000000
        r1  = 0x000000000000000b
        r2  = 0x000000000000000c
        r3  = 0x0000000000000030
        r4  = 0x000000000000005c
        r5  = 0x0000000000003c05
        r6  = 0x0000fff09015a9b8
        r7  = 0xfefeff6f6071735e
        r8  = 0x7f7f7f7f7f7f7f7f
        r9  = 0x0000000000000000
        r10 = 0x0101010101010101
        r11 = 0x0000000000000028
        r12 = 0x0000a701409d1276
        r13 = 0x0000000000000040
        r14 = 0x000000000000003f
        r15 = 0x0000000000000000
        r16 = 0x0000ffffa651dc00
        r17 = 0x0000ffffa6bc4080
        r18 = 0x0000000000000000
        r19 = 0x0000000000000030
        r20 = 0x0000fff090484208
        r21 = 0x0000fff0901b94f8
        r22 = 0x0000ffffa6600340
        r23 = 0x0000000000000001
        r24 = 0x0000000000000021
        r25 = 0x0000fff09045a8f8
        r26 = 0x0000000000000021
        r27 = 0x0000000000000108
        r28 = 0x0000000000000021
        r29 = 0x0000fff106572880
        r30 = 0x0000ffffa635068c
        r31 = 0x0000fff106572880
        q0  = 0xabababab abababab abababab abababab
        q1  = 0x901b8f28 0000fff0 901b8f28 0000fff0
        q2  = 0x9045a8f8 0000fff0 9045a8f8 0000fff0
        q3  = 0x00000000 00000000 00000000 00000000
        q4  = 0x00000000 00000000 00000000 00000000
        q5  = 0x94683a38 0000fff0 94684d60 0000fff0
        q6  = 0x00000000 00000000 00000000 00000000
        q7  = 0x40100401 40100401 40100401 40100401
        q8  = 0x00000000 00000000 00000000 00000000
        q9  = 0x00000000 00000000 00000000 00000000
        q10 = 0x00000000 00000000 00000000 00000000
        q11 = 0x00000000 00000000 00000000 00000000
        q12 = 0x00000000 00000000 00000000 00000000
        q13 = 0x00000000 00000000 00000000 00000000
        q14 = 0x00000000 00000000 00000000 00000000
        q15 = 0x00000000 00000000 00000000 00000000
        q16 = 0x01005555 00005040 01005555 00005040
        q17 = 0x10000000 aa800010 00001000 a00a8000
        q18 = 0x00100000 00000000 80000000 80200802
        q19 = 0x00000300 00000000 00000000 00000000
        q20 = 0x11111111 01111111 00000000 00000000
        q21 = 0x00000000 10000000 00000000 00000000
        q22 = 0x00000000 0c000000 00000000 00000000
        q23 = 0x00000000 03000000 00000000 00000000
        q24 = 0x00000000 00c00000 00000000 00000000
        q25 = 0x00000000 00300000 00000000 00000000
        q26 = 0x00000000 000c0000 00000000 00000000
        q27 = 0x0c000000 00000000 00000000 00000000
        q28 = 0x30000000 00000000 00000000 00000000
        q29 = 0x0000000c 00000000 00000000 00000000
        q30 = 0x03000000 00000000 00000000 00000000
        q31 = 0x55555555 00015555 00000000 00000000
        eflags = 0x0000000020000000
        pc     = 0x0000ffff240d3d8c
Entry into F559721(0x0000ffffa635042c).0x0000ffff24170460 (shared)
fcache_enter = 0x0000ffff22d10b80, target = 0x0000ffff2417045c

main_signal_handler: thread=1588266, sig=11, xsp=0x0000fff923c94da0, retaddr=0x000000000000000b
siginfo: sig = 11, pid = 264, status = 0, errno = 0, si_code = 1
        x0     = 0x0000000000000000
        x1     = 0x0000fff923c6e000
        x2     = 0x000000000000000c
        x3     = 0x0000000000000030
        x4     = 0x000000000000005c
        x5     = 0x0000000000003c05
        x6     = 0x0000fff09015a9b8
        x7     = 0xfefeff6f6071735e
        x8     = 0x7f7f7f7f7f7f7f7f
        x9     = 0x0000000000000000
        x10    = 0x0101010101010101
        x11    = 0x0000000000000028
        x12    = 0x0000a701409d1276
        x13    = 0x0000000000000040
        x14    = 0x000000000000003f
        x15    = 0x0000000000000000
        x16    = 0x0000ffffa651dc00
        x17    = 0x0000ffffa6bc4080
        x18    = 0x0000000000000000
        x19    = 0x0000000000000030
        x20    = 0x0000fff090484208
        x21    = 0x0000fff0901b94f8
        x22    = 0x0000ffffa6600340
        x23    = 0x0000000000000001
        x24    = 0x0000000000000021
        x25    = 0x0000fff09045a8f8
        x26    = 0x0000000000000021
        x27    = 0x0000000000000108
        x28    = 0x0000000000000021
        x29    = 0x0000fff106572880
        x30    = 0x0000ffffa635068c
        sp     = 0x0000fff106572880
        pc     = 0x0000ffff2417046c
        pstate = 0x0000000020000000
dcontext next tag = 0x0000ffff2417045c
computing memory target for 0x0000ffff2417046c causing SIGSEGV, kernel claims it is 0x0000000000000108
compute_memory_target: falling back to racy protection checks
opnd_compute_address for: (%x0,%x28,lsl #3)
        base => 0x0000000000000000
        index,scale => 0x0000000000000108
        disp => 0x0000000000000108
For SIGSEGV at cache pc 0x0000ffff2417046c, computed target read 0x0000000000000108
        faulting instr: ldr    (%x0,%x28,lsl #3)[8byte] -> %x0
** Received SIGSEGV at cache pc 0x0000ffff2417046c in thread 1588266
record_pending_signal(11) from cache pc 0x0000ffff2417046c
        not certain can delay so handling now
        action is not SIG_IGN
translate context, thread 1588266 at pc_recreatable spot translating

kuhanov · 2022-05-17T07:09:58Z

Hi, @derekbruening
We are now ready and want to contribute changes that unblock usage of DynamoRIO for JVM workloads. Internal company approval process was not so easy for us. :) In order to do that we need a public branch at official DynamoRIO repository e.g. like i3733-bug-fixes. We will be using the banch to deliver our commits into it and then send pull requests from that branch. Cloud you please help with that and create such a branch for i3733 bug fixes?
Thx, Kirill

derekbruening · 2022-05-17T16:45:27Z

That is great news. I've sent you an invite for commit privileges so you can create your own branches. Normally we create a new temporary branch for each PR.

prasun3 · 2023-05-11T08:18:40Z

@kuhanov I was curious if you are still planning to contribute your changes

kuhanov · 2023-05-15T09:37:42Z

Hi. In general we switched to drcachesim collector. It is more stable and provide offline data for analysis.
Overhead is also much less against our online collectors.
One point that we probably try to improve is speedup of raw data to trace (drraw2trace tool). Currently it takes a lot of time.
Thx, Kirill

derekbruening · 2023-05-15T16:37:39Z

But weren't all the issues you hit and the fixes you were going to contribute relating to the core of DR and so would be present in the drcachesim drmemtrace tracer too?

kuhanov · 2023-05-16T14:36:49Z

But weren't all the issues you hit and the fixes you were going to contribute relating to the core of DR and so would be present in the drcachesim drmemtrace tracer too?

ok. I looked in our branch and investigated what we have for core and ext in our local branch

There are 3 types of patches:
fatures for enabling instruction mix, we added categories for grouping instructions:

Instruction categories.
Added instruction group types for AARch64 for instruction mix splitting

fixes:

Added guard for readlink syscall
Restore SSE register context during signals. They were not hadled correctly and were empty

Workarounds. This is not product solution but these unblocked us for collecting data for java (we had limited resources to invetigate that deeper)

[AArch64][jdk8] Incorrect handling synchro signal in case mangling epilogue pc. Added workaround
FAKE_TAG: remove setting pending_delete_pc for unlinked fragments. Prepared for remove fragment goes to execution on some reason. So, we need to stay it in tables.
Fix debug assert. AArch64 java threads have huge waiting time. Increase value to allow them to do that
Remove assert from debug and return false if start and end app_pc is not defined. We caught such issues on AArch64 with JVM profiling
Fixed incorrect mcontext restoring when signal have recieved in cti_short executing

I suppose, we could share these patches, maybe these could be added to DRIO project backlog

Thanks, Kirill

derekbruening · 2023-05-16T16:13:34Z

Please share any bug fixes: otherwise someone else may hit the same problem and spend essentially wasted time re-debugging and re-fixing what is already sitting fixed in a private branch somewhere which is not a good situation. We ourselves may start running Java in the future and would not want to have to re-discover and re-fix all these things.

kuhanov · 2023-05-18T07:33:22Z

ok. I'll prepare review requests to have ability to link on them. maybe is there better way to share our patches?
Kirill

derekbruening · 2023-05-18T18:40:29Z

ok. I'll prepare review requests to have ability to link on them. maybe is there better way to share our patches? Kirill

Thank you. I think a PR is good even for the ones labeled workarounds where you're not sure if it's the proper long-term approach.

kuhanov · 2023-07-12T14:33:54Z

ok. I'll prepare review requests to have ability to link on them. maybe is there better way to share our patches? Kirill

Thank you. I think a PR is good even for the ones labeled workarounds where you're not sure if it's the proper long-term approach.

https://github.com/DynamoRIO/dynamorio/tree/i3733-jvm-bug-workarounds

derekbruening · 2023-07-25T23:59:55Z

Thanks. At a quick glance we have 8 changes:

https://github.com/DynamoRIO/dynamorio/tree/i3733-jvm-bug-workarounds

thread_set_self_mcontext not setting SIMD state: but it looks like thread_set_self_context() sets the fpstate pointer and calls save_fpstate(), clobbering the mcontext fpstate you're trying to have written here? Probably I'm missing something?
Incorrect xl8 of mangling epilogue pc
FAKE_TAG: I don't understand this one
bump assert on sleep
vmareas.c binary_search is passed start or end==NULL
drbbdup_event_restore_state not handling cti_short_rewrite => what ab drreg restore state
raw2trace kernel marker if not in module but in JIT code => I fixed in Apr'23 in PR i#2062 memtrace nonmod part 3: Kernel interruption PC #6001
raw2trace "temp workaround": not sure I understand it at first glance

derekbruening added the Status-NeedInfo label Jul 16, 2019

derekbruening mentioned this issue May 5, 2022

SPEC jbb'15 [jdk8] sharedRuntime.cpp:549 failed: safepoint polling: pc must refer to an nmethod #3892

Open

[jdk8] SPECjvm 2008 tests won't run #3733

[jdk8] SPECjvm 2008 tests won't run #3733

Comments

rkgithubs commented Jul 15, 2019 • edited by derekbruening Loading

derekbruening commented Jul 16, 2019

derekbruening commented Jul 16, 2019

rkgithubs commented Jul 17, 2019

rkgithubs commented Jul 18, 2019 • edited Loading

derekbruening commented Jul 18, 2019

rkgithubs commented Jul 18, 2019

kuhanov commented Jun 1, 2021 • edited Loading

derekbruening commented Jun 1, 2021

kuhanov commented Jun 10, 2021

derekbruening commented Jun 10, 2021

abudankov commented Jun 23, 2021 • edited by derekbruening Loading

derekbruening commented Jun 24, 2021

abudankov commented Jun 24, 2021

derekbruening commented Jun 24, 2021

abudankov commented Jun 24, 2021

abudankov commented Jul 5, 2021

derekbruening commented Jul 6, 2021 • edited Loading

abudankov commented Jul 6, 2021

derekbruening commented Jul 6, 2021

derekbruening commented Jul 6, 2021

abudankov commented Jul 21, 2021

derekbruening commented Jul 21, 2021

abudankov commented Jul 21, 2021 • edited Loading

derekbruening commented Jul 22, 2021

abudankov commented Jul 23, 2021

derekbruening commented Jul 23, 2021

kuhanov commented Jul 30, 2021

derekbruening commented Jul 30, 2021

kuhanov commented Jul 30, 2021

kuhanov commented Feb 15, 2022

kuhanov commented Feb 16, 2022

kuhanov commented Feb 16, 2022

derekbruening commented Feb 16, 2022

derekbruening commented Feb 16, 2022

kuhanov commented Feb 17, 2022

derekbruening commented Feb 17, 2022

kuhanov commented Feb 18, 2022

derekbruening commented Feb 18, 2022

kuhanov commented Feb 20, 2022

kuhanov commented Feb 25, 2022

kuhanov commented Mar 4, 2022

kuhanov commented Mar 9, 2022

kuhanov commented Mar 15, 2022

derekbruening commented Mar 15, 2022

kuhanov commented Mar 16, 2022

kuhanov commented Mar 16, 2022

kuhanov commented Mar 16, 2022

kuhanov commented May 17, 2022

derekbruening commented May 17, 2022

prasun3 commented May 11, 2023

kuhanov commented May 15, 2023

derekbruening commented May 15, 2023

kuhanov commented May 16, 2023

derekbruening commented May 16, 2023

kuhanov commented May 18, 2023

derekbruening commented May 18, 2023

kuhanov commented Jul 12, 2023

derekbruening commented Jul 25, 2023

rkgithubs commented Jul 15, 2019 •

edited by derekbruening

Loading

rkgithubs commented Jul 18, 2019 •

edited

Loading

kuhanov commented Jun 1, 2021 •

edited

Loading

abudankov commented Jun 23, 2021 •

edited by derekbruening

Loading

derekbruening commented Jul 6, 2021 •

edited

Loading

abudankov commented Jul 21, 2021 •

edited

Loading