Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#1369: Use synch flush callback to enable drcachesim tracing. #4491

Merged
merged 23 commits into from
Oct 30, 2020

Conversation

abhinav92003
Copy link
Contributor

DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred.

But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling.

In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist.

In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time.

This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed.

Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before.

Issue: #1369

DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred.

But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling.

In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist.

In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time.

This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed.

Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before.

Issue: #1369
@abhinav92003
Copy link
Contributor Author

PR in progress...

Local run of a proprietary app shows that some threads seg fault after the flush completes.

@abhinav92003
Copy link
Contributor Author

abhinav92003 commented Oct 20, 2020

Ah, so the segfaulting threads were redirected to the reset exit stub by DR at [1]. The SIGSEGV occurs in the exit stub shown below (see [2] for DR code that emits this).

Thread xxx received signal SIGSEGV, Segmentation fault.
0x0000000046fc2458 in ?? ()
(gdb) x/50i 0x46fc2458
=> 0x46fc2458:	stp	x0, x1, [x28]
   0x46fc245c:	mov	x0, #0x4c20                	// #19488
   0x46fc2460:	movk	x0, #0xf7f0, lsl #16
   0x46fc2464:	movk	x0, #0xffff, lsl #32
   0x46fc2468:	ldr	x1, [x28, #64]
   0x46fc246c:	br	x1
   ...

The exit stub expects the stolen reg to be set up already, but it doesn't seem to be -- see x28 below.

(gdb) i r 
x0             0xfffffffffffffffc  -4
x1             0xfffff45bcc80      281474781400192
...
x28            0x30000             196608
...
pc             0x46fc2458          0x46fc2458

After adding arch_mcontext_reset_stolen_reg the SIGSEGV following dr_flush_region goes away. But the current implementation does it only for INTERNAL_OPTION(steal_reg_at_reset) != 0 (zero is the default).

arch_mcontext_reset_stolen_reg(dcontext, mc);

Trying to check whether the stolen reg should have been set up already somewhere else, for the steal_reg_at_reset==0 (default) case.

[1]: translate_from_synchall_to_dispatch

mc->pc = (app_pc)get_reset_exit_stub(dcontext);

[2]: exit stub emitting code:

/* stp x0, x1, [x(stolen), #(offs)] */

@derekbruening
Copy link
Contributor

steal_reg_at_reset is a diagnostic option we added in the past for identifying and debugging stolen reg mangling bugs on 32-bit arm.

Without this, some threads segfault due to incorrect value in the stolen reg.
This is a source compatibility change and will require change in client source code. The added documentation describes it as such.
While working on some draft changes, the existing TRY_EXCEPT caused an ASSERT failure due to the passed dcontext not being the current thread's. But it seems to have gone away due to changes made since.
@abhinav92003 abhinav92003 marked this pull request as ready for review October 21, 2020 02:45
@abhinav92003
Copy link
Contributor Author

This PR seems to resolve the issue. I'll mark the issue fixed in the final commit message.

core/synch.c Outdated Show resolved Hide resolved
@abhinav92003
Copy link
Contributor Author

The exit stub expects the stolen reg to be set up already, but it doesn't seem to be

linux.thread-reset is capable of reproducing the above segfault-after-reset. BUT the current reset_at_fragment_count arg used in the test suite doesn't allow it to. As it is, the reset happens much before the child thread (the test has just 2 threads) starts. But the test did fail when I manually adjusted reset_at_fragment_count to make the reset happen when both threads are active.

Manually adjusting reset_at_fragment_count in the test suite doesn't seem like a good idea though, as the ideal value may change in future without us noticing, and also it seems to be different on A64 and x86. I tried change the option to reset_at_every_nth_fragment_count and tried with a low value like 10, but that leads to another debug assert failure:

<Application dynamorio/suite/tests/bin/linux.thread (38747).  Internal Error: DynamoRIO debug check failure: Not implemented @dynamorio/core/unix/signal_linux_aarch64.c:53 (0)

@derekbruening
Copy link
Contributor

Manually adjusting reset_at_fragment_count in the test suite doesn't seem like a good idea though, as the ideal value may change in future without us noticing, and also it seems to be different on A64 and x86.

Other options:

  • -reset_at_nth_thread.
  • We could add an annotation to trigger a reset programmatically from the app. However, annotations aren't implemented for A64 (another missing feature...)
  • We could use a nudge of type reset, requested from a client or possibly forked helper app.

Currently, the tests perform reset at a given fragment count. That count is reached much before the child thread is created. When there's just one thread, the complete reset path is not invoked.

To fix this, we cannot simply change the -reset_at_fragment_count value, as the ideal value is prone to change without us noticing. Instead, we perform reset at a given thread count.

Verified that linux.thread-reset and linux.clone-reset actually crash with a SIGSEGV on AArch64 without the stolen reg restore in core/synch.c
@abhinav92003
Copy link
Contributor Author

Other options:

  • -reset_at_nth_thread.

Thanks, I used this suggestion.

I also removed the existing reset_at_fragment_count flag as it seemed to me that it is useful only for the reset tests. As it was added before the initial open source import of DR, I cannot verify the context in which it was added. I added a release note for this.

Let me know if you think we should retain the flag.

@derekbruening
Copy link
Contributor

I also removed the existing reset_at_fragment_count flag as it seemed to me that it is useful only for the reset tests. As it was added before the initial open source import of DR, I cannot verify the context in which it was added. I added a release note for this.

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

This option is still useful for other manual debugging use cases.
@abhinav92003
Copy link
Contributor Author

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

Oh okay, better to keep it then. I added it back now.

@derekbruening
Copy link
Contributor

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

Oh okay, better to keep it then. I added it back now.

While having tons of options does make it harder to test and validate all the code paths and I agree that removing unused ones is a good cleanup, I think this option will be useful: in fact, since I'm having such a hard time reproducing #4460 in an isolated test, the mentioned binary-search approach on the bigger app may be my next step (I'll s/ARM/AARCHXX/ for -steal_reg_at_reset), since it went away stealing x28 but was there for x29.

Copy link
Contributor

@derekbruening derekbruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay -- somehow I thought a review had not yet been requested and it was still in an exploratory phase.

api/docs/release.dox Outdated Show resolved Hide resolved
clients/drcachesim/tracer/tracer.cpp Outdated Show resolved Hide resolved
api/docs/release.dox Outdated Show resolved Hide resolved
clients/drcachesim/tracer/tracer.cpp Outdated Show resolved Hide resolved
core/lib/instrument_api.h Outdated Show resolved Hide resolved
core/synch.c Outdated Show resolved Hide resolved
core/synch.c Outdated Show resolved Hide resolved
suite/tests/linux/thread.c Outdated Show resolved Hide resolved
suite/tests/linux/thread.c Outdated Show resolved Hide resolved
…back.

Added dr_flush_region_ex that also invokes the callback. The existing dr_flush_region simply invokes dr_flush_region_ex with a NULL callback.
This is to clearly differentiate from reset_at_nth_thread, which checks the existing number of active threads.
@abhinav92003
Copy link
Contributor Author

I separated out the stolen reg changes in PR #4498. After that is submitted, will update this PR's branch so that it contains just the synch flush callback related changes.

api/samples/cbr.c Outdated Show resolved Hide resolved
api/docs/release.dox Outdated Show resolved Hide resolved
api/docs/release.dox Outdated Show resolved Hide resolved
api/docs/release.dox Outdated Show resolved Hide resolved
@abhinav92003 abhinav92003 merged commit 893c06c into master Oct 30, 2020
@abhinav92003 abhinav92003 deleted the i1369-drcachesim-synch-flush-callback branch October 30, 2020 07:45
abhinav92003 added a commit that referenced this pull request Dec 2, 2020
Ignore failure in tool.drcachesim.delay-simple test on AArch64 to unblock various PRs. 

Confirmed locally that this test started failing at 893c06c; the failure didn't show up on the Jenkins test suite for PR #4491 somehow.



Issue: #4571
derekbruening pushed a commit that referenced this pull request Dec 2, 2020
Ignore failure in tool.drcachesim.delay-simple test on AArch64 to unblock various PRs. 

Confirmed locally that this test started failing at 893c06c; the failure didn't show up on the Jenkins test suite for PR #4491 somehow.

Issue: #4571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants