-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i#1369: Use synch flush callback to enable drcachesim tracing. #4491
Conversation
DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred. But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling. In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist. In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time. This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed. Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before. Issue: #1369
PR in progress... Local run of a proprietary app shows that some threads seg fault after the flush completes. |
Ah, so the segfaulting threads were redirected to the reset exit stub by DR at [1]. The
The exit stub expects the stolen reg to be set up already, but it doesn't seem to be -- see
After adding Line 1749 in 9bedb9c
Trying to check whether the stolen reg should have been set up already somewhere else, for the [1]: Line 1816 in 9bedb9c
[2]: exit stub emitting code: dynamorio/core/arch/aarch64/emit_utils.c Line 121 in 9bedb9c
|
|
Without this, some threads segfault due to incorrect value in the stolen reg.
This is a source compatibility change and will require change in client source code. The added documentation describes it as such.
While working on some draft changes, the existing TRY_EXCEPT caused an ASSERT failure due to the passed dcontext not being the current thread's. But it seems to have gone away due to changes made since.
This PR seems to resolve the issue. I'll mark the issue fixed in the final commit message. |
Manually adjusting
|
Other options:
|
Currently, the tests perform reset at a given fragment count. That count is reached much before the child thread is created. When there's just one thread, the complete reset path is not invoked. To fix this, we cannot simply change the -reset_at_fragment_count value, as the ideal value is prone to change without us noticing. Instead, we perform reset at a given thread count. Verified that linux.thread-reset and linux.clone-reset actually crash with a SIGSEGV on AArch64 without the stolen reg restore in core/synch.c
Thanks, I used this suggestion. I also removed the existing Let me know if you think we should retain the flag. |
I remember using it to binary-search which blocks trigger a bug: esp when combined with -steal_reg_at_reset. |
This option is still useful for other manual debugging use cases.
Oh okay, better to keep it then. I added it back now. |
While having tons of options does make it harder to test and validate all the code paths and I agree that removing unused ones is a good cleanup, I think this option will be useful: in fact, since I'm having such a hard time reproducing #4460 in an isolated test, the mentioned binary-search approach on the bigger app may be my next step (I'll s/ARM/AARCHXX/ for -steal_reg_at_reset), since it went away stealing x28 but was there for x29. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay -- somehow I thought a review had not yet been requested and it was still in an exploratory phase.
…back. Added dr_flush_region_ex that also invokes the callback. The existing dr_flush_region simply invokes dr_flush_region_ex with a NULL callback.
This is to clearly differentiate from reset_at_nth_thread, which checks the existing number of active threads.
I separated out the stolen reg changes in PR #4498. After that is submitted, will update this PR's branch so that it contains just the synch flush callback related changes. |
DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred.
But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling.
In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist.
In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time.
This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed.
Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before.
Issue: #1369