Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support detach on 32-bit ARM (AArch64 finished) #1578

Open
derekbruening opened this issue Dec 30, 2014 · 9 comments
Open

support detach on 32-bit ARM (AArch64 finished) #1578

derekbruening opened this issue Dec 30, 2014 · 9 comments

Comments

@derekbruening
Copy link
Contributor

To determine whether a thread has exited I'm storing a value in the app TLS slot we're using. To properly detach we'll need to instead restore that value. This issue covers coming up with some other solution to the problem of figuring out whether a thread has exited, or else delaying the detach value restore.

derekbruening added a commit that referenced this issue Mar 25, 2020
Adds an alternative scheme for achieving a post-call control point
that does not require flushing or shared data structure examination
per-call: replacing the return address with a sentinel.

When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is
replaced with the address of a single return instruction in the client
library, with the real address saved.  When a block is seen consisting
of that sentinel instruction, post-call callbacks are called, and then
control is sent to the saved real address using
dr_redirect_native_target().

Adds wrapping tests to drwrap-test.

This new scheme requires restoring return addresses on the stack on
detach or other state translation.  Adds functionality to do so, along
with a new test client.drwrap-test-detach.

This requires the client's state restoration event be called for
addresses not in the code cache.  Adds such a call.

Adds comments about translation problems with clean call mangling
which is filed as i#4219.  The issues seen here are all limited to
traces, so the test works around the problems with -disable_traces.

Tested the core drwrap behavior on ARM and AArch64 but missing general
detach support there (#1578) prevents enabling the detach test there.

Issue: #4219
Fixes #4197
derekbruening added a commit that referenced this issue Mar 26, 2020
Adds an alternative scheme for achieving a post-call control point
that does not require flushing or shared data structure examination
per-call: replacing the return address with a sentinel.

When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is
replaced with the address of a single return instruction in the client
library, with the real address saved.  When a block is seen consisting
of that sentinel instruction, post-call callbacks are called, and then
control is sent to the saved real address using
dr_redirect_native_target().

Adds wrapping tests to drwrap-test.

This new scheme requires restoring return addresses on the stack on
detach or other state translation.  Adds functionality to do so, along
with a new test client.drwrap-test-detach.

This requires the client's state restoration event be called for
addresses not in the code cache.  Adds such a call.

Adds comments about translation problems with clean call mangling
which is filed as i#4219.  The issues seen here are all limited to
traces, so the test works around the problems with -disable_traces.

Tested the core drwrap behavior on ARM and AArch64 but missing general
detach support there (#1578) prevents enabling the detach test there.

Issue: #4219
Fixes #4197
@derekbruening
Copy link
Contributor Author

If we only support detach for -private_loader we could put a flag in the
client TLS area, which we allocate in the same mmap but place prior to the
TCB? I guess the problem is static DR, which is a major use case.

@derekbruening
Copy link
Contributor Author

The other detach issue I'm seeing is a crash due to not setting xsp in notify_and_jmp_without_stack. There's even a comment there that "xsp is only set for X86" but no reason is given. We have to set it to get the frame for sigreturn at the right spot: not really understanding that comment when it doesn't have a TODO or FIXME.

@derekbruening
Copy link
Contributor Author

After fixing the setting of xsp, I'm seeing something strange: immediately past the sigreturn everything is good, but as soon as we execute the SYS_futex we returned to for __pthread_cond_wait, the thread state is messed up. This causes a SIGSEGV natively, and under gdb it messes up gdb such that it's unusable:

(gdb) x/4i 0xffffb7a882c8
   0xffffb7a882c8 <__pthread_cond_wait+312>:	svc	#0x0
   0xffffb7a882cc <__pthread_cond_wait+316>:	mov	w19, #0x0                   	// #0
   0xffffb7a882d0 <__pthread_cond_wait+320>:	ldr	w0, [x29,#160]
   0xffffb7a882d4 <__pthread_cond_wait+324>:	bl	0xffffb7a8b3c8 <__pthread_disable_asynccancel>
(gdb) x/4gx 0xffffb792f8d0
0xffffb792f8d0:	0x0000ffffb792f990	0x0000aaaaaaaac244
0xffffb792f8e0:	0x0000ffffb7930200	0x0000fffffffff3c8
(gdb) b * 0xffffb7a882c8
Breakpoint 2 at 0xffffb7a882c8: file pthread_cond_wait.c, line 186.
(gdb) c
Continuing.

Thread 2 "api.detach" hit Breakpoint 2, 0x0000ffffb7a882c8 in __pthread_cond_wait (cond=0xaaaaaaac7080, mutex=0xaaaaaaac70b0) at pthread_cond_wait.c:186
186	pthread_cond_wait.c: No such file or directory.
1: x/i $pc
=> 0xffffb7a882c8 <__pthread_cond_wait+312>:	svc	#0x0
2: /x $sp = 0xffffb792f8d0
(gdb) bt
#0  0x0000ffffb7a882c8 in __pthread_cond_wait (cond=0xaaaaaaac7080, mutex=0xaaaaaaac70b0) at pthread_cond_wait.c:186
#1  0x0000aaaaaaaac244 in wait_cond_var (var=0xaaaaaaac7080) at /home/derek/dr/src/suite/tests/condvar.h:134
#2  0x0000aaaaaaaac4d4 in sideline_spinner (arg=0x0) at /home/derek/dr/src/suite/tests/api/detach.c:147
#3  0x0000ffffb7a820a0 in start_thread (arg=0xaaaaaaaac414 <sideline_spinner>) at pthread_create.c:335
#4  0x0000ffffb79f8eac in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:77
(gdb) stepi
thread_get_info_callback: cannot get thread info: generic error
(gdb) info reg
Selected thread is running.
(gdb) p 0x62
$3 = 98
(gdb) info all-registers
Selected thread is running.
(gdb) bt
Selected thread is running.
^C
(gdb) Quit
(gdb) bt
Selected thread is running.

If we messed up the tpidr* register wouldn't that show up more immediately? Why does running the SYS_futex ruin gdb? Why is gdb so fragile: it should IMHO handle anything user mode does.

derekbruening added a commit that referenced this issue Aug 18, 2020
Fixes a crash on detach on aarchxx by setting xsp to the proper value
for SYS_rt_sigreturn to find the signal frame for resuming app state.

Tested manually on aarch64 the api.detach test in gdb.  Unfortunately
there are further bugs, so the test cannot be enabled yet.

Issue: #1578
derekbruening added a commit that referenced this issue Aug 18, 2020
Fixes a crash on detach on aarchxx by setting xsp to the proper value
for SYS_rt_sigreturn to find the signal frame for resuming app state.

Tested manually on aarch64 on the api.detach test in gdb.  Unfortunately
there are further bugs, so the test cannot be enabled yet.

Issue: #1578
derekbruening added a commit that referenced this issue Sep 10, 2020
Sets os_should_swap_state() to true.
Adds support for os_swap_context() twice in a row with the same state.
Removes os_swap_context_go_native().

The api.detach test now passes on AArch64.

Enables all of the api.* and api.static_* tests for AArchXX, except
api.startstop( crash) and api.static_prepop (hang) which will be fixed
separately.

Fixes a build error in the api.static_crash test: a missing header.

Issue: #1578, #1582
@derekbruening
Copy link
Contributor Author

Another issue: for an app not using libc (like a pure-asm app: suite/tests/bin/allasm_aarch64_flush) the app TLS pointer is NULL, which causes problems in DR's exit sequence where for non-x86 we need a base value b/c the DR pointer is a secondary pointer at an offset inside the base. My comment in os_switch_seg_to_context() for my fix for now:

        if (os_tls->app_lib_tls_base == NULL) {
            /* XXX i#1578: For pure-asm apps that do not use libc, the app may have no
             * thread register value.  For detach we would like to write a 0 back into
             * the thread register, but it complicates our exit code, which wants access
             * to DR's TLS between dynamo_thread_exit_common()'s call to
             * dynamo_thread_not_under_dynamo() and its call to
             * set_thread_private_dcontext(NULL).  For now we just leave our privlib
             * segment in there.  It seems rather unlikely to cause a problem: app code
             * is unlikely to read the thread register; it's going to assume it owns it
             * and will just blindly write to it.
             */
            return true;
        }

derekbruening added a commit that referenced this issue Sep 30, 2020
Sets os_should_swap_state() to true.
Adds support for os_swap_context() twice in a row with the same state.
Removes os_swap_context_go_native().

For a no-libc app with NULL TLS (like our pure-asm tests), we leave our
TLS pointer in the register on exit to simplify the exit process for now.
This is a minor transparency issue on detach: going to live with it for now.

The api.detach test now passes on AArch64.

Enables all of the api.* and api.static_* tests for AArchXX, except
api.startstop( crash), api.static_noclient (assert), api.thread_churn
(crash), and api.static_prepop (hang) which will be fixed
separately.

Fixes a build error in the api.static_crash test: a missing header.

Issue: #1578, #1582

Co-authored-by: Abhinav Anil Sharma <[email protected]>
@derekbruening
Copy link
Contributor Author

So we now have api.detach and several other tests passing.

api.static_signal passes most of the time on Jenkins but once it got an extra "Got SIGSEGV" (2 in a row): http://139.178.83.194:8080/job/DynamoRIO-AArch64-Precommit/1711/consoleFull
It also seems to hang on the tx1 packet.net machine so more work there.

@derekbruening
Copy link
Contributor Author

PR #4470 for #4468 enabled client.drwrap-test-detach but it seems to still have problems:#4467 (comment)

derekbruening added a commit that referenced this issue Oct 8, 2020
PR #4470 for #4468 enabled client.drwrap-test-detach but it seems to
still have problems: PR #4467 found that it hangs 28 out of 100 times.
We re-disable it here for now.

Issue: #1578
derekbruening added a commit that referenced this issue Oct 9, 2020
PR #4470 for #4468 enabled client.drwrap-test-detach but it seems to
still have problems: PR #4467 found that it hangs 28 out of 100 times.
We re-disable it here for now.

Issue: #1578
@derekbruening
Copy link
Contributor Author

derekbruening commented Jan 26, 2021

The status of trying to enable all the api.* and api.static_* start/stop/detach tests as of when I last looked at them, back on October 2020:


  • api.detach now works after the above fixes

199: Test command: /home/derek/dr/build/bin64/runstats "-s" "90" "-killpg" "-silent" "-env" "LD_LIBRARY_PATH" "/home/derek/dr/build/lib64/debug:/home/derek/dr/build/ext/lib64/debug:" "-env" "DYNAMORIO_OPTIONS" "-stderr_mask 0xC -dumpcore_mask 0 -code_api" "/home/derek/dr/build/suite/tests/bin/api.startstop"
199: Test timeout computed to be: 1500
199: <Application /home/derek/dr/build/suite/tests/bin/api.startstop (45351).  DynamoRIO internal crash at PC 0x0000000000000000.  Please report this at http://dynamorio.org/issues/.  Program aborted.
199: Received SIGSEGV at pc 0x0000000000000000 in thread 45352
199: Base: 0x0000ffffb095a000
199: Registers:	eflags=0x0000000080000000
199: version 8.0.18537, custom build
199: -code_api -stderr_mask 12 -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
199: 0x0000ffffb07e28d0 0x0000aaaad306a284
199: 0x0000ffffb07e2990 0x0000aaaad306a54c
199: 0x0000ffffb07e29c0 0x0000ffffb09350a0
199: 0x0000ffffb07e29f0 0x0000ffffb08abeac>




Once on Jenkins it got an extra "Got SIGSEGV" (2 in a row)


211: <Application /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/build/build_debug-internal-64/suite/tests/bin/api.thread_churn (21138).  DynamoRIO internal crash at PC 0x0000ffffb07f0c2c.  Please report this at http://dynamorio.org/issues/.  Program aborted.
211: Received SIGSEGV at pc 0x0000ffffb07f0c2c in thread 21138

202: api.static_noclient: /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/suite/tests/api/static_noclient.c:89: test_static_decode_before_attach: Assertion `res == 0 && memcmp(&mask, &check_mask, sizeof(mask)) == 0' failed.





derekbruening added a commit that referenced this issue Feb 5, 2021
The api.startstop test was crashing previously, but one of the many
recent fixes seems to have addressed that problem.  It now runs to
completion 1000x in a row on AArch64.

Issue: #1578, #4474
derekbruening added a commit that referenced this issue Feb 5, 2021
The api.startstop test was crashing previously, but one of the many
recent fixes seems to have addressed that problem.  It now runs to
completion 1000x in a row on AArch64.

Issue: #1578, #4474
@derekbruening
Copy link
Contributor Author

Since the base detach test and api.startstop and api.detach_state (GPR step) now work, base detach support is considered complete. The remaining issues have been filed as separate issues. I updated the checklist above to show which issues cover which outstanding tests.

derekbruening added a commit that referenced this issue Feb 5, 2021
Fixes a bug in the api.static_prepop assembly where the link register
is clobbered, leading to an infinite loop on aarchxx.
Enables the test for aarchxx.

Fixes a bug on identifying stopping points on ARM where LSB Thumb
decoration on the two sides of the comparison was not consistent.
Fixes a corresponding bug on going native where the LSB was not set.

Tested manually on ARM.

Issue: #1578, #4717, #4720
Fixes #4717
@derekbruening
Copy link
Contributor Author

I closed this since detach is working for AArch64, but it still seems to have problems on ARM. #4720 is one of those. Maybe it should be re-opened for 32-bit ARM work. Some tests we've ensured work on AArch64 we have not had resources or machines to get working on ARM.

@derekbruening derekbruening reopened this Feb 5, 2021
@derekbruening derekbruening changed the title support detach on ARM support detach on 32-bit ARM (AArch64 finished) Feb 5, 2021
derekbruening added a commit that referenced this issue Feb 5, 2021
Fixes a bug in the api.static_prepop assembly where the link register
is clobbered, leading to an infinite loop on aarchxx.
Enables the test for aarchxx.

Fixes a bug on identifying stopping points on ARM where LSB Thumb
decoration on the two sides of the comparison was not consistent.
Fixes a corresponding bug on going native where the LSB was not set.

Tested manually on ARM.

Issue: #1578, #4717, #4720
Fixes #4717
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant