i#1369: Use synch flush callback to enable drcachesim tracing. #4491

abhinav92003 · 2020-10-20T01:21:58Z

DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred.

But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling.

In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist.

In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time.

This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed.

Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before.

Issue: #1369

DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred. But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling. In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist. In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time. This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed. Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before. Issue: #1369

abhinav92003 · 2020-10-20T01:52:45Z

PR in progress...

Local run of a proprietary app shows that some threads seg fault after the flush completes.

abhinav92003 · 2020-10-20T08:40:52Z

Ah, so the segfaulting threads were redirected to the reset exit stub by DR at [1]. The SIGSEGV occurs in the exit stub shown below (see [2] for DR code that emits this).

Thread xxx received signal SIGSEGV, Segmentation fault.
0x0000000046fc2458 in ?? ()
(gdb) x/50i 0x46fc2458
=> 0x46fc2458:	stp	x0, x1, [x28]
   0x46fc245c:	mov	x0, #0x4c20                	// #19488
   0x46fc2460:	movk	x0, #0xf7f0, lsl #16
   0x46fc2464:	movk	x0, #0xffff, lsl #32
   0x46fc2468:	ldr	x1, [x28, #64]
   0x46fc246c:	br	x1
   ...

The exit stub expects the stolen reg to be set up already, but it doesn't seem to be -- see x28 below.

(gdb) i r 
x0             0xfffffffffffffffc  -4
x1             0xfffff45bcc80      281474781400192
...
x28            0x30000             196608
...
pc             0x46fc2458          0x46fc2458

After adding arch_mcontext_reset_stolen_reg the SIGSEGV following dr_flush_region goes away. But the current implementation does it only for INTERNAL_OPTION(steal_reg_at_reset) != 0 (zero is the default).

dynamorio/core/synch.c

Line 1749 in 9bedb9c

arch_mcontext_reset_stolen_reg(dcontext, mc);

Trying to check whether the stolen reg should have been set up already somewhere else, for the steal_reg_at_reset==0 (default) case.

[1]: translate_from_synchall_to_dispatch

dynamorio/core/synch.c

Line 1816 in 9bedb9c

mc->pc = (app_pc)get_reset_exit_stub(dcontext);

[2]: exit stub emitting code:

dynamorio/core/arch/aarch64/emit_utils.c

Line 121 in 9bedb9c

/* stp x0, x1, [x(stolen), #(offs)] */

derekbruening · 2020-10-20T18:55:20Z

steal_reg_at_reset is a diagnostic option we added in the past for identifying and debugging stolen reg mangling bugs on 32-bit arm.

Without this, some threads segfault due to incorrect value in the stolen reg.

…ush.

This is a source compatibility change and will require change in client source code. The added documentation describes it as such.

While working on some draft changes, the existing TRY_EXCEPT caused an ASSERT failure due to the passed dcontext not being the current thread's. But it seems to have gone away due to changes made since.

abhinav92003 · 2020-10-21T02:46:31Z

This PR seems to resolve the issue. I'll mark the issue fixed in the final commit message.

core/synch.c

…gion.

abhinav92003 · 2020-10-22T16:59:17Z

The exit stub expects the stolen reg to be set up already, but it doesn't seem to be

linux.thread-reset is capable of reproducing the above segfault-after-reset. BUT the current reset_at_fragment_count arg used in the test suite doesn't allow it to. As it is, the reset happens much before the child thread (the test has just 2 threads) starts. But the test did fail when I manually adjusted reset_at_fragment_count to make the reset happen when both threads are active.

Manually adjusting reset_at_fragment_count in the test suite doesn't seem like a good idea though, as the ideal value may change in future without us noticing, and also it seems to be different on A64 and x86. I tried change the option to reset_at_every_nth_fragment_count and tried with a low value like 10, but that leads to another debug assert failure:

<Application dynamorio/suite/tests/bin/linux.thread (38747).  Internal Error: DynamoRIO debug check failure: Not implemented @dynamorio/core/unix/signal_linux_aarch64.c:53 (0)

derekbruening · 2020-10-22T17:34:38Z

Manually adjusting reset_at_fragment_count in the test suite doesn't seem like a good idea though, as the ideal value may change in future without us noticing, and also it seems to be different on A64 and x86.

Other options:

-reset_at_nth_thread.
We could add an annotation to trigger a reset programmatically from the app. However, annotations aren't implemented for A64 (another missing feature...)
We could use a nudge of type reset, requested from a client or possibly forked helper app.

Currently, the tests perform reset at a given fragment count. That count is reached much before the child thread is created. When there's just one thread, the complete reset path is not invoked. To fix this, we cannot simply change the -reset_at_fragment_count value, as the ideal value is prone to change without us noticing. Instead, we perform reset at a given thread count. Verified that linux.thread-reset and linux.clone-reset actually crash with a SIGSEGV on AArch64 without the stolen reg restore in core/synch.c

abhinav92003 · 2020-10-22T22:36:12Z

Other options:

-reset_at_nth_thread.

Thanks, I used this suggestion.

I also removed the existing reset_at_fragment_count flag as it seemed to me that it is useful only for the reset tests. As it was added before the initial open source import of DR, I cannot verify the context in which it was added. I added a release note for this.

Let me know if you think we should retain the flag.

derekbruening · 2020-10-22T23:35:42Z

I also removed the existing reset_at_fragment_count flag as it seemed to me that it is useful only for the reset tests. As it was added before the initial open source import of DR, I cannot verify the context in which it was added. I added a release note for this.

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

This option is still useful for other manual debugging use cases.

abhinav92003 · 2020-10-23T01:37:08Z

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

Oh okay, better to keep it then. I added it back now.

derekbruening · 2020-10-23T14:13:22Z

I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset.

Oh okay, better to keep it then. I added it back now.

While having tons of options does make it harder to test and validate all the code paths and I agree that removing unused ones is a good cleanup, I think this option will be useful: in fact, since I'm having such a hard time reproducing #4460 in an isolated test, the mentioned binary-search approach on the bigger app may be my next step (I'll s/ARM/AARCHXX/ for -steal_reg_at_reset), since it went away stealing x28 but was there for x29.

derekbruening

Sorry for the delay -- somehow I thought a review had not yet been requested and it was still in an exploratory phase.

api/docs/release.dox

clients/drcachesim/tracer/tracer.cpp

api/docs/release.dox

clients/drcachesim/tracer/tracer.cpp

core/lib/instrument_api.h

core/synch.c

suite/tests/linux/thread.c

…back. Added dr_flush_region_ex that also invokes the callback. The existing dr_flush_region simply invokes dr_flush_region_ex with a NULL callback.

This is to clearly differentiate from reset_at_nth_thread, which checks the existing number of active threads.

abhinav92003 · 2020-10-28T06:38:34Z

I separated out the stolen reg changes in PR #4498. After that is submitted, will update this PR's branch so that it contains just the synch flush callback related changes.

api/samples/cbr.c

api/docs/release.dox

… PR.

Ignore failure in tool.drcachesim.delay-simple test on AArch64 to unblock various PRs. Confirmed locally that this test started failing at 893c06c; the failure didn't show up on the Jenkins test suite for PR #4491 somehow. Issue: #4571

abhinav92003 added 7 commits October 20, 2020 16:12

Fix args to dr_flush_region call-site in events.dll.c test

e2e89bb

Load stolen reg with TLS base address before entering reset exit stub.

5342bf6

Without this, some threads segfault due to incorrect value in the stolen reg.

Add callback function to dr_flush_region call in existing test for fl…

7561b94

…ush.

Fix args to dr_flush_region call in cbr.dll.c test.

3d3f318

Document dr_flush_region signature change in release doc.

4979965

This is a source compatibility change and will require change in client source code. The added documentation describes it as such.

Update issue number for inlining instr counter increment in drcachesim.

6df5dbf

Revert change made in TRY_EXCEPT in clean_call_opt_shared.c

333da5b

While working on some draft changes, the existing TRY_EXCEPT caused an ASSERT failure due to the passed dcontext not being the current thread's. But it seems to have gone away due to changes made since.

abhinav92003 marked this pull request as ready for review October 21, 2020 02:45

abhinav92003 requested a review from derekbruening October 21, 2020 02:45

abhinav92003 commented Oct 21, 2020

View reviewed changes

core/synch.c Outdated Show resolved Hide resolved

Add missing callback invocation for early return cases of dr_flush_re…

71f273f

…gion.

Add back reset_at_fragment_count option that was removed earlier.

2889982

This option is still useful for other manual debugging use cases.

derekbruening requested changes Oct 27, 2020

View reviewed changes

derekbruening mentioned this pull request Oct 27, 2020

AArch64 CRASH from some stolen register handling bug #4460

Closed

abhinav92003 added 4 commits October 27, 2020 18:42

Preserve source compatibility by creating new API for flush with call…

8b2f7b9

…back. Added dr_flush_region_ex that also invokes the callback. The existing dr_flush_region simply invokes dr_flush_region_ex with a NULL callback.

Rename new option to reset_at_created_thread_count.

a2514e7

This is to clearly differentiate from reset_at_nth_thread, which checks the existing number of active threads.

Minor style fixes.

0b375a7

Add user_data arg to synch flush callback function.

f9d64ac

Optimise mcontext initialisation by avoiding excessive zeroing.

8459b6c

abhinav92003 mentioned this pull request Oct 28, 2020

ASSERT failure using reset_at_nth_thread 2 on test app #4496

Closed

abhinav92003 added 3 commits October 27, 2020 23:26

Add XXX comments to describe future work issues.

32fd389

Fix signature of defined dr_flush_region_ex.

9af7c48

Add missing user_data to early-return invocations of callback.

47c8172

abhinav92003 mentioned this pull request Oct 28, 2020

SIGSEGV due to wrong stolen reg value after synchall #4497

Closed

Fix formatting and a callsite with missing arg.

b2ebbea

abhinav92003 mentioned this pull request Oct 28, 2020

i#4497: Add missing stolen reg load after synchall for AArch64. #4498

Closed

abhinav92003 requested a review from derekbruening October 28, 2020 07:10

derekbruening approved these changes Oct 28, 2020

View reviewed changes

api/samples/cbr.c Outdated Show resolved Hide resolved

api/docs/release.dox Outdated Show resolved Hide resolved

api/docs/release.dox Outdated Show resolved Hide resolved

api/docs/release.dox Outdated Show resolved Hide resolved

abhinav92003 added 3 commits October 30, 2020 00:24

Revert changes related to i#4497 that were separated into a different…

f421141

… PR.

Comment, TODO and documentation changes.

be2e3aa

Merge branch 'master' into i1369-drcachesim-synch-flush-callback

b6ac919

abhinav92003 merged commit 893c06c into master Oct 30, 2020

abhinav92003 deleted the i1369-drcachesim-synch-flush-callback branch October 30, 2020 07:45

This was referenced Nov 3, 2020

ASSERT (drmgr-test) meta-instr faulted? must set translation field and handle fault! #1369

Open

i#4508: Stolen register not loaded in dr_get_mcontext on AArch64 #4513

Merged

abhinav92003 mentioned this pull request Nov 11, 2020

i#4495: Preserve translated stolen register #4526

Merged

abhinav92003 mentioned this pull request Dec 1, 2020

drcachesim.delay-simple test failing on A64 #4571

Closed

abhinav92003 mentioned this pull request Dec 2, 2020

i#4571: Ignore drcachesim delay-simple failure #4578

Merged

derekbruening mentioned this pull request Jan 19, 2021

i#4487: inline instr count for trace_after_instrs in drcachesim, AArch64 #4677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i#1369: Use synch flush callback to enable drcachesim tracing. #4491

i#1369: Use synch flush callback to enable drcachesim tracing. #4491

abhinav92003 commented Oct 20, 2020

abhinav92003 commented Oct 20, 2020

abhinav92003 commented Oct 20, 2020 •

edited

Loading

derekbruening commented Oct 20, 2020

abhinav92003 commented Oct 21, 2020

abhinav92003 commented Oct 22, 2020

derekbruening commented Oct 22, 2020

abhinav92003 commented Oct 22, 2020

derekbruening commented Oct 22, 2020

abhinav92003 commented Oct 23, 2020

derekbruening commented Oct 23, 2020

derekbruening left a comment

abhinav92003 commented Oct 28, 2020

i#1369: Use synch flush callback to enable drcachesim tracing. #4491

i#1369: Use synch flush callback to enable drcachesim tracing. #4491

Conversation

abhinav92003 commented Oct 20, 2020

abhinav92003 commented Oct 20, 2020

abhinav92003 commented Oct 20, 2020 • edited Loading

derekbruening commented Oct 20, 2020

abhinav92003 commented Oct 21, 2020

abhinav92003 commented Oct 22, 2020

derekbruening commented Oct 22, 2020

abhinav92003 commented Oct 22, 2020

derekbruening commented Oct 22, 2020

abhinav92003 commented Oct 23, 2020

derekbruening commented Oct 23, 2020

derekbruening left a comment

Choose a reason for hiding this comment

abhinav92003 commented Oct 28, 2020

abhinav92003 commented Oct 20, 2020 •

edited

Loading