Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_thread_tls_initialized() fails to distinguish a thread in a new thread group from a fork child #2089

Closed
derekbruening opened this issue Dec 1, 2016 · 7 comments

Comments

@derekbruening
Copy link
Contributor

This is the problem, the || part:

            return (tid == get_sys_thread_id() ||
                    /* We assume we can safely de-ref the dcontext now */
                    /* The child of a fork will initially come here */
                    os_tls->state.spill_space.dcontext->owning_process ==
                    get_parent_id());

A thread with CLONE_VM but not CLONE_THREAD will end up with a different thread id but will inherit the parent TLS (for use of MSR anyway where the zeroing of TLS_SEG has no effect: we plan to remove that zeroing anyway in #2088). It will pass is_thread_tls_initialized() which will wreak all kinds of havoc as it shares its parent's dcontext, resulting in strange crashes and hangs.

The plan is to redesign how is_thread_tls_initialized() works from the ground up. Xref #1986 and other cases where is_thread_tls_initialized() is also just not good enough.

@derekbruening
Copy link
Contributor Author

I can reproduce the bug observed on other apps by making the linux.clone test built as PIE, which forces use of the MSR rather than the GDT and results in a crash:

% bin64/drrun -- suite/tests/bin/linux.clone
<Starting application /work/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.clone (14070)>
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/work/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so' 0x000055a50c82b9d0
>
<Initial options = -no_dynamic_options -code_api -msgbox_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -no_safe_read_tls_init -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<(1+x) Handling our fault in a TRY at 0x000055a50cabc90e>
Segmentation fault

@derekbruening
Copy link
Contributor Author

The history here is interesting: there are multiple references to avoiding syscalls in is_thread_tls_initialized(), yet we regressed for MSR support.

See get_thread_id():

    /* i#228/PR 494330: making a syscall here is a perf bottleneck since we call
     * this routine in read and recursive locks so use the TLS value instead
     */
    thread_id_t id = get_tls_thread_id();
...
  thread_id_t
  get_tls_thread_id(void)
  {
      ptr_int_t tid; /* can't use thread_id_t since it's 32-bits */
      if (!is_thread_tls_initialized())
          return INVALID_THREAD_ID;

Similarly:

  bool
  is_thread_initialized(void)
  {
  #if defined(UNIX) && defined(HAVE_TLS)
      /* We don't want to pay the get_thread_id() cost on every
       * get_thread_private_dcontext() when we only really need the
       * check for this call here, so we explicitly check.
       */
      if (get_tls_thread_id() != get_sys_thread_id())
          return false;
  #endif
      return (get_thread_private_dcontext() != NULL);
  }

@derekbruening
Copy link
Contributor Author

Micro-benchmark: call dr_get_current_drcontext() 268 million times in empty client init:

# for i in 0x0fffffff; do for j in -no_safe_read_tls_init -safe_read_tls_init; do echo $i $j; /usr/bin/time bin64/drrun $j -stderr_mask 0 -client_lib "api/bin/libempty.so;$i;" -- ls OUT; done; done
0x0fffffff -no_safe_read_tls_init
iters=268435455
OUT
6.94user 5.81system 0:12.76elapsed 99%CPU (0avgtext+0avgdata 8912maxresident)k
0inputs+8outputs (0major+688minor)pagefaults 0swaps
0x0fffffff -safe_read_tls_init
iters=268435455
OUT
1.89user 0.12system 0:02.01elapsed 99%CPU (0avgtext+0avgdata 8904maxresident)k
8inputs+8outputs (0major+688minor)pagefaults 0swaps

So 6x faster.

@derekbruening
Copy link
Contributor Author

Testing plain debug build:

Running an old already-built gtest w/ a lot of code (suid sandbox not
disabled so this doesn't launch a window but it does run several
sub-processes):

# /usr/bin/time bin64/drrun -- /work/chromium/src/out/Release/browser_tests --single-process-tests --gtest_filter=PlatformAppBrowserTest.RunningAppsAreRecorded
-no_safe_read_tls_init
16.45user 11.41system 0:27.97elapsed 99%CPU (0avgtext+0avgdata 68840maxresident)k
-safe_read_tls_init
14.10user 8.86system 0:23.08elapsed 99%CPU (0avgtext+0avgdata 68900maxresident)k

The ratio increases as checklevel is dropped until at -checklevel 0 it's
80% faster:

-checklevel 1 -no_safe_read_tls_init
7.97user 2.23system 0:10.37elapsed 98%CPU (0avgtext+0avgdata 69164maxresident)k
-checklevel 1 -safe_read_tls_init
5.92user 0.13system 0:06.24elapsed 97%CPU (0avgtext+0avgdata 69052maxresident)k

-checklevel 0 -no_safe_read_tls_init
5.45user 1.83system 0:07.46elapsed 97%CPU (0avgtext+0avgdata 69348maxresident)k
-checklevel 0 -safe_read_tls_init
3.82user 0.13system 0:04.14elapsed 95%CPU (0avgtext+0avgdata 69124maxresident)k

Release build shows non-trivial impact as well:

-no_safe_read_tls_init
1.10user 0.31system 0:01.60elapsed 88%CPU (0avgtext+0avgdata 65296maxresident)k
-safe_read_tls_init
0.96user 0.10system 0:01.26elapsed 84%CPU (0avgtext+0avgdata 65240maxresident)k

@derekbruening
Copy link
Contributor Author

The forthcoming commit message does a good job of summarizing the design of the solution here:

Puts in a new scheme for is_thread_tls_initialized(): rather than reading
thread id fields and comparing to expensive syscalls, and trying to
distinguish different fork and clone children, we instead use a very simple
approach.  We add a magic number field to our TLS and
is_thread_tls_initialized() simply does a safe read of that field: if it
equals our magic number, it's initialized.

On a clone, the parent swaps to a separate private TLS with an invalid
magic field.  The child inherits this and thus starts out correctly
uninitialized, while the parent's syscall gencode still works as it ignores
the magic field.  The parent's TLS is restored first thing in dispatch().
On a fork, nothing is done, and thus the magic number remains and the child
is correctly initialized.

We avoid safe read faults on thread exit by pointing the exiting thread's
TLS at a fake TLS with an invalid magic number.  This is efficient for regular
threads and required for threads missing CLONE_SIGHAND, for which we must
also skip the segment zeroing at thread exit.

On swapping to native, we also put in the private TLS.  On re-take-over, we
add new queries (including using a special "invalid" magic number to
distinguish from a completely unknown thread) and TLS swaps to several
attach points to avoid thinking that a NULL dcontext or
!is_thread_tls_initialized() means an unknown thread.

Generalizes the segment swapping code to correctly swap DR's segment too
instead of what it did before where it took in a segment parameter but then
hardcoded the register to use.

The app using the aux segment is not yet supported: that's #2088/#107.

The new scheme is under the -safe_read_tls_init option, which is now on by
default.  Once we're sure it's stable we'll remove the option and the old
code.

Adds a new test linux.clone-pie, a PIE build of linux.clone, which forces
the use of the MSR for TLS instead of the GDT and results in a crash with
-no_safe_read_tls_init.

Expands the linux.clone* tests to also test a thread without CLONE_SIGHAND.

@derekbruening
Copy link
Contributor Author

I abandoned the initial plan of setting up the child's TLS in the parent: it was much simpler to set an invalid TLS instead.

There were many complications of this around native vs DR swaps, init and exit corner cases, 32-bit vs 64-bit, etc.

@derekbruening
Copy link
Contributor Author

Pasting some notes on complications with thread exit that were solved, but in case we need to tweak the solution:

**** DONE HANG even w/ i#2089 fix: non-CLONE_SIGHAND thread exit can't handle TLS safe_read fault
     CLOSED: [2016-12-06 Tue 21:42]
     - State "DONE"       from "TODO"       [2016-12-06 Tue 21:42]

Looks like I mis-diagnosed the hang there too as the thread that held
thread_initexit_lock exited normally:

debug_proxy_server_test.89579.00000000/log.57.90237.html
SYS_exit(60) in thread 90237 of 90237 => cleaning up thread

This still happens.  So i#2089 was only responsible for the "weird thread"
assert.

It's a separate thread group not sharing signal handlers, so a crash after
our signal cleanup won't be reported.

lock by 19219
trylock by 19219
thread exit 19219: 1
thread exit 19219: 2
thread exit 19219: 3
thread exit 19219: 4
thread exit 19219: 5
os_tls_exit 19219: 1
os_tls_exit 19219: 2
os_tls_thread_exit 19219: 1
tls_thread_free 19219: 2
os_tls_thread_exit 19219: 2
os_tls_thread_exit 19219: 3
os_tls_exit 19219: 3
os_tls_exit 19219: 3: unmap of 0x00007f22c68ed000
os_tls_exit 19219: 3: unmap of 0x00007f22c68ed000

(gdb) p *heapmgt
$2 = {
  vmheap = {
    start_addr = 0x7f22c0054000 "",
    end_addr = 0x7f22e0054000 "",

# grep 7f22c68e /proc/19148/maps
7f22c68de000-7f22c68eb000 rw-p 00000000 00:00 0
7f22c68eb000-7f22c68ed000 ---p 00000000 00:00 0
7f22c68ed000-7f22c68ee000 rwxp 00000000 00:00 0
7f22c68ee000-7f22c68f1000 ---p 00000000 00:00 0

Program received signal SIGSEGV, Segmentation fault.
safe_read_tls_magic () at dr/git/src/core/arch/x86/x86.asm:2374
2374            mov      r11, rax                       /* preserve */
(gdb) bt
#0  safe_read_tls_magic () at dr/git/src/core/arch/x86/x86.asm:2374
#1  0x00000000712d64bb in is_thread_tls_initialized () at dr/git/src/core/unix/os.c:1446
#2  0x00000000712d8a60 in get_tls_thread_id () at dr/git/src/core/unix/os.c:2575
#3  0x00000000712d8a3e in get_thread_id () at dr/git/src/core/unix/os.c:2564
#4  0x000000007111d5dc in self_owns_write_lock (rw=0x7f48fd20b700) at dr/git/src/core/utils.c:1326
#5  0x0000000071308cf8 in memcache_lock () at dr/git/src/core/unix/memcache.c:130
#6  0x00000000712e5cc8 in all_memory_areas_lock () at dr/git/src/core/unix/os.c:7361
#7  0x00000000711ca007 in dynamo_vm_areas_lock () at dr/git/src/core/vmareas.c:3504
#8  0x00000000711a6c1d in release_memory_and_update_areas (p=0x7f4903970000 "", size=12288, decommit=false, remove_vm=true)
    at dr/git/src/core/heap.c:1927
#9  0x00000000711a7756 in release_guarded_real_memory (p=0x7f4903970000 "", size=12288, remove_vm=true, guarded=true)
    at dr/git/src/core/heap.c:2099
#10 0x00000000711a973e in heap_munmap_ex (p=0x7f4903971000, size=4096, guarded=true) at dr/git/src/core/heap.c:2359
#11 0x00000000711a996e in heap_munmap (p=0x7f4903971000, size=4096) at dr/git/src/core/heap.c:2376
#12 0x00000000712d765c in os_tls_exit (local_state=0x7f4903971000, other_thread=false) at dr/git/src/core/unix/os.c:2001
#13 0x0000000071093d06 in dynamo_thread_exit_common (dcontext=0x7f48fd780680, id=21361, other_thread=false) at dr/git/src/core/dynamo.c:2512
#14 0x0000000071093e2d in dynamo_thread_exit () at dr/git/src/core/dynamo.c:2547
#15 0x00000000712bcc28 in cat_thread_only () at dr/git/src/core/arch/x86/x86.asm:630
(gdb) c
Continuing.
[Inferior 1 (process 21361) exited with code 02]

There's no handler b/c there's no CLONE_SIGHAND and we cleaned up the
private handler.

It crashes b/c while we pointed at &uninit_tls in os_thread_exit(),
os_tls_exit() calls tls_thread_free().
And in fact that's before the heap_munmap which is where the thread exit
fault was originally seen: confirmed in gdb on pthreads.pthreads.  So I
failed to remove the fault.

linux.clone doesn't hit this b/c it passes CLONE_SIGHAND.

So we need to move the clear later to avoid the fault every thread exit.
But, to be at very end of thread exit, it would need to be in
dynamo_thread_stack_free_and_exit().  Just for thread exit, we could leave
&uninit_tls in there (or leave the handler in there).  Even for GDT, the
kernel is swaping one slot, so there should be no problem not clearing it
when the thread exits.

For detach we want to free any LDT/GDT slot or clear the MSR at the last
moment.  This already happens today:
sig_detach() calls:
    os_thread_not_under_dynamo(dcontext);
    os_tls_thread_exit(dcontext->local_state);

After that it looks like it will not invoke any TLS queries.  The
heap_munmap in os_tls_exit() is invoked by the detaching master thread.

So it seems that os_tls_exit(), os_tls_thread_exit(), and tls_thread_free()
should *not* zero the segment or MSR for -safe_read_tls_init unless
doing_detach.

We solve this for shared handlers for the final thread via a global
last_thread_tls_exited.  We wouldn't remove that b/c the last thread
finishes its exit and goes on to other stuff for process exit (IIRC).

***** DONE add linux.clone-nosig test => made part of linux.clone
      CLOSED: [2016-12-06 Tue 19:29]
      - State "DONE"       from "TODO"       [2016-12-06 Tue 19:29]

***** DONE re-verify no TLS faults in pthread.pthreads
      CLOSED: [2016-12-06 Tue 19:29]
      - State "DONE"       from "TODO"       [2016-12-06 Tue 19:29]

hgreving2304 pushed a commit that referenced this issue Apr 10, 2019
When creating client threads, we still attempted to invalidate the thread's TLS by writing
a null selector to its segment register. In 64-bit mode, this seemed to work most of the
time because presumably Intel and AMD are then zero'ing out the upper 32-bit of the hidden
segment's base. Intel describes this in Vol. 3A 3.4.4, although not completely clear.

This patch applies the same scheme introduced and described in #2089.

Before this patch, the test code_api|client.thread ran into all kinds of assertions,
because the client thread interfered with the parent's TLS. These failures are now gone.

Fixes #3526
Issue: #2089
hgreving2304 pushed a commit that referenced this issue Apr 10, 2019
When creating client threads, we still attempted to invalidate the thread's TLS by writing
a null selector to its segment register. In 64-bit mode, this seemed to work most of the
time because presumably Intel and AMD are then zero'ing out the the hidden segment's
base. Intel does _not_ adequately describe this e.g. in Vol. 3A 3.4.4.

This patch applies the same scheme introduced and described in #2089.

Before this patch, the test code_api|client.thread ran into all kinds of assertions,
because the client thread interfered with the parent's TLS. These failures are now gone.

Fixes #3526
Issue: #2089
hgreving2304 pushed a commit that referenced this issue Apr 22, 2019
When creating client threads, we still attempted to invalidate the thread's TLS by writing
a null selector to its segment register. In 64-bit mode, this seemed to work most of the
time because presumably Intel and AMD are then zero'ing out the the hidden segment's
base. Intel does _not_ adequately describe this e.g. in Vol. 3A 3.4.4.

This patch applies the same scheme introduced and described in #2089.

Before this patch, the test code_api|client.thread ran into all kinds of assertions,
because the client thread interfered with the parent's TLS. These failures are now gone.

Fixes #3526
Issue: #2089
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant