Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement subset of arch_prctl syscall to support local-exec TLS of statically linked executables #1137

Closed
wkozaczuk opened this issue May 18, 2021 · 4 comments · Fixed by #1269

Comments

@wkozaczuk
Copy link
Collaborator

The Golang apps like our golang-pie-example built with newer Golang compiler/linker (at least as of 1.15) no longer run correctly on OSv and crash like so:

OSv v0.55.0-247-g1433c08b
eth0: 192.168.122.15
Booted up in 162.61 ms
Cmdline: /hello.so
syscall(): unimplemented system call 158
Aborted

[backtrace]
0x00000000402178f7 <???+1075935479>
0x000000004046369e <osv::handle_mmap_fault(unsigned long, int, exception_frame*)+30>
0x0000000040342a1a <mmu::vm_fault(unsigned long, exception_frame*)+298>
0x0000000040390d5f <page_fault+143>
0x000000004038fc16 <???+1077476374>
0x0000100000066c9f <???+421023>
0x000000004045f2a5 <???+1078325925>
0x00000000403fbd89 <thread_main_c+41>
0x0000000040390b92 <???+1077480338>

The unimplemented syscall 158 is __NR_arch_prctl which can be used to set FS_BASE for local-exec TLS. It looks like something had changed in the Golang build-chain between 1.12 and 1.15. The Golang apps built with -buildmode=pie used to work just fine on OSv. It seems that in the past the Golang produced pies used to use two ways of communicating with kernel - 1) via SYSCALL instruction and 2) via libc calls for pthreads (OSv still does not support the clone syscall). The newer build chain seems to produce an executable that only communicates via SYSCALL.

Here is some info about the executable:

file apps/golang-pie-example/hello.so 
apps/golang-pie-example/hello.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=KLddGH5JkygDOUUjd9N8/Bw6msSNbQDLVNmZHbfg9/N-nQvH6Ne9pl5aExWhga/Ncb0ercFgAfBxICpTZba, not stripped

ldd apps/golang-pie-example/hello.so
	statically linked

readelf -l apps/golang-pie-example/hello.so 

Elf file type is DYN (Shared object file)
Entry point 0x465080
There are 12 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000002a0 0x00000000000002a0  R      0x1000
  INTERP         0x0000000000000fe4 0x0000000000400fe4 0x0000000000400fe4
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  NOTE           0x0000000000000f80 0x0000000000400f80 0x0000000000400f80
                 0x0000000000000064 0x0000000000000064  R      0x4
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000099ed0 0x0000000000099ed0  R E    0x1000
  LOAD           0x000000000009a000 0x000000000049a000 0x000000000049a000
                 0x0000000000072ce5 0x0000000000072ce5  R      0x1000
  LOAD           0x000000000010d000 0x000000000050d000 0x000000000050d000
                 0x0000000000086467 0x0000000000086467  RW     0x1000
  GNU_RELRO      0x000000000010d000 0x000000000050d000 0x000000000050d000
                 0x0000000000086467 0x0000000000086467         0x1000
  LOAD           0x0000000000194000 0x0000000000594000 0x0000000000594000
                 0x0000000000015a00 0x00000000000481a8  RW     0x1000
  DYNAMIC        0x0000000000194040 0x0000000000594040 0x0000000000594040
                 0x00000000000000b0 0x00000000000000b0  RW     0x8
  TLS            0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000008  R      0x8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x8
  LOOS+0x5041580 0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000         0x8

readelf -s apps/golang-pie-example/hello-stripped.so 

Symbol table '.dynsym' contains 1 entry:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 

The info seems to be a bit confusing as it suggests that the executable is position-dependent (the entry point) and is statically linked.

I have also experimented with basic C hello world app linked statically and here is some info:

 readelf -l apps/native-static-example/hello

Elf file type is EXEC (Executable file)
Entry point 0x401ce0
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000488 0x0000000000000488  R      0x1000
  LOAD           0x0000000000001000 0x0000000000401000 0x0000000000401000
                 0x0000000000081d5d 0x0000000000081d5d  R E    0x1000
  LOAD           0x0000000000083000 0x0000000000483000 0x0000000000483000
                 0x0000000000027038 0x0000000000027038  R      0x1000
  LOAD           0x00000000000ab000 0x00000000004ac000 0x00000000004ac000
                 0x0000000000005310 0x0000000000006b40  RW     0x1000
  NOTE           0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x0000000000000044 0x0000000000000044  R      0x4
  TLS            0x00000000000ab000 0x00000000004ac000 0x00000000004ac000
                 0x0000000000000020 0x0000000000000060  R      0x8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x00000000000ab000 0x00000000004ac000 0x00000000004ac000
                 0x0000000000003000 0x0000000000003000  R      0x1

strace apps/native-static-example/hello
execve("apps/native-static-example/hello", ["apps/native-static-example/hello"], 0x7ffc5ad38b90 /* 44 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffedba29df0) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x1dfc000
brk(0x1dfcd80)                          = 0x1dfcd80
arch_prctl(ARCH_SET_FS, 0x1dfc380)      = 0
uname({sysname="Linux", nodename="fedora-mbpro", ...}) = 0
readlink("/proc/self/exe", "/home/wkozaczuk/projects/osv-mas"..., 4096) = 68
brk(0x1e1dd80)                          = 0x1e1dd80
brk(0x1e1e000)                          = 0x1e1e000
mprotect(0x4ac000, 12288, PROT_READ)    = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0
write(1, "Hello from static C code\n", 25Hello from static C code
) = 25
exit_group(0)                           = ?
+++ exited with 0 +++

So in each case, apps make the arch_prctl syscall which OSv would need to support.

It is not the first time the issue of arch_prctl is raised - please see this conversation - https://groups.google.com/g/osv-dev/c/qLfRBRRUIHM and this one related to one of the patches attempting to add the arch_prctl support - https://groups.google.com/g/osv-dev/c/FPkZzUb5uB4/m/npEg9GqDAgAJ.

Here is what @nyh wrote in the latter one:

In an earlier review in https://groups.google.com/forum/#!msg/osv-dev/PW3bkaVCuMg/340vWUeimdMJ
and also today in another thread, I pointed out the problems with this arch_prctl thing which makes this patch either incomplete or even counter-productive:

  1. After a thread does ARCH_SET_FS, if it get switched out and then back in, thread::switch_to() will overwrite this setting with a fixed pointer "thread->_tcb". So to support ARCH_SET_FS we need to save and restore this value on context switch (or more efficiently, save it only on ARCH_SET_FS, not on every switch out, and restore on switch in).
  1. After a thread does ARCH_SET_FS, it cannot call any OSv code (or any library function in general). We need to modify the system-call code to restore the standard fs_base (the _tcb) when entering the system call, and restore it back to the thread's chosen address when exiting the system call.
  1. It's even worse if after a thread does ARCH_SET_FS is calls a function directly, not through a system call. e.g., malloc(). It is possible (???) that Go is aware of this problem and restores the FS_BASE before calling C functions, but if it doesn't, we're in big problems.

My sense is that adding support of arch_prctl will be a bit tricky but not impossible if we stick to certain assumptions:

  • we would only support the syscall but not arch_prctl glibc wrapper on purpose
  • we would only support arch_prctl with statically linked executables both position-dependent and independent ones only; the key thing is that the SYSCALL instruction would provide us with necessary trigger points to switch the FS register to point to correct TLS block for a given thread

So here is a concrete list of things to be done:

  • implement the arch_prctl syscall to handle ARCH_SET_FS and ARCH_GET_FS and store the ARCH_SET_FS parameter in a new field for the current thread
  • modify the syscall assembly to save FS register on entry and restore FS register on exit if there is one set by ARCH_SET_FS for current thread
  • modify the page fault handler and possibly interrupt handler (assembly?) to save FS register on entry and restore FS register on exit if there is one set by ARCH_SET_FS for the current thread (only if app?)
  • possibly change context switch code to save and restore FS register; this may not be necessary as this would be done by the above 2
@nyh
Copy link
Contributor

nyh commented May 19, 2021

Your idea of supporting arch_prctl only for static executables (which can't call any OSv functions) makes sense.

However, because the issue you created is specific to golang, before we do this I would ask to consider why golang went back to static executables, and whether they really did that switch. If I remember the history corrected, golang started with static executables, but then switched to dynamic executables which use glibc only for minimal things. Why did they switch back - do you know the motivation? Did they switch back? The page https://www.arp242.net/static-go.html suggests that maybe both compilation modes are still supported - but the dynamic version is used only if "cgo" is used. Since OSv is, for all intents and purposes, a C library, maybe it would be fine to force Go programs which are built for OSv to declare that they do, in fact, use a C library - and force the go compiler to generate a dynamic executable?

@wkozaczuk
Copy link
Collaborator Author

Thanks for the link to that article. Indeed it is still possible to generate dynamically linked Go executables and very often it is the default if the app uses networking (which non-trivial app does not use networking?). There are some changes between some executables generated by older (~1.12) and newer (~1.15) Go toolchain which affects its support on OSv which I will open a separate issue to address. The bottom line though is that golang apps built as static are not supported on OSv and one of the reasons is the missing arch_prctl syscall.

@wkozaczuk wkozaczuk changed the title Implement subset of arch_prctl syscall to support local-exec TLS of modern Golang applications Implement subset of arch_prctl syscall to support local-exec TLS of statically linked executables May 21, 2021
@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented Apr 18, 2023

Over the weekend I have been trying to make some progress on the static ELF support. To that end, I resurrected some of the old commits from the static-elf branch originally written by Pekka and refined some of those to add more functional brk support and some other missing syscalls. In essence, I have been trying to get native-example built as a static executable to run on OSv. And I have been able to get OSv to start the app and handle a good chunk of syscalls up to the point when it calls the arch_prtcl with ARCH_SET_FS.

Obviously, as it was raised in the original Pekka's patches reviews (8 years ago) and somewhat described in one of my comments above, the arch_prtcl cannot set the FS register to the address passed to the arch_prtcl but instead it should just save it somewhere as part of the current app thread info (some new field). However, on exit from this syscall the FS should be set to that user value. And from this point on any time this app thread makes any other syscall call or any interrupt is taken or a page fault is triggered, we somehow need to re-set FS to the _tcb (the OSv thread descriptor) on entry and switch back FS to the user value (we saved in arch_prtcl) on exit.

Now the problem is how do we exactly accomplish it? I mean setting the FS to the user value on the exit from arch_prtcl is easy. But all the subsequent switches from user FS to OSv one and back are not. Someone raised the possibility of using swapgs (see https://www.felixcloutier.com/x86/swapgs) which is exactly what Linux does. The swapgs swaps the value of gs register with the value contained in MSR address C0000102H (IA32_KERNEL_GS_BASE). So possibly, we could store the same OSv value of FS (=_tcb) in this MSR but then we have to change it every time we change FS on each context switch in thread::switch_to(). Then on entry to every interrupt handler, page fault, syscall, etc, we would call swapgs, gs would have the OSv address of _tcb and we would somehow in assembly copy that value from GS to FS using wrmsr and a couple of other instructions and registers before which we would somehow have to save somewhere.

Now there are at least two problems with this:

  1. Resetting the value of IA32_KERNEL_GS_BASE on every context switch is pretty expensive. It requires wrmsr and adding this line:
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 3678eafc..d1555575 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -81,6 +81,7 @@ void thread::switch_to()
     // barriers
     barrier();
     set_fsbase(reinterpret_cast<u64>(_tcb));
+    processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
     barrier();
     auto c = _detached_state->_cpu;
     old->_state.exception_stack = c->arch.get_exception_stack();

makes the context switch (as measured by misc-ctxsw) go up from 313 to 362. Unfortunately, we cannot use fsgsbase instruction which I think would make things faster. But I think this can be optimized as we could make it conditional by checking if the new application FS field (that saves what arch_prtcl passed) is set so maybe this would not be such a big deal.

  1. We cannot blindly use swapgs on every syscall/interrupt/etc entry and exit. We should ONLY do it if the current thread is an application one with user TLS (arch_prtcl was called). Otherwise, in the nested case scenario (syscall with an interrupt or page fault) we would do it twice and end up using the app value of GS if any. In other words we should never use swapgs if on kernel thread or in kernel code (interrupt or page fault handler) when triggered from app thread. Linux has to take special care of when to call swapgs as well (please read this). In essence, it would have this code at the beginning of each entry:
xorl %ebx,%ebx
testl $3,CS+8(%rsp)
je error_kernelspace
SWAPGS

which amounts to (from that article) - "pick this info off the entry frame on the kernel stack, from the CS of the ptregs area of the kernel stack". From what I understand this code checks the CPL (Current Privilege Level) based on the last 2 bits in the CS register value placed in the exception/interrupt frame. Now, in OSv cannot apply a similar solution (I might be incorrect) as everything (kernel and app) runs in ring 0 and those 2 low bits in CS would never change, would they (see this).

So this is what I propose instead which is based on the assumption that applications would not be using the GS register or we would not let them:

  1. Make arch_prtcl fail ARCH_SET_GS (what kind of apps would it affect as GS in theory can be used for whatever reason unlike FS?).
  2. Do not advertise the fsgsbase instruction in /proc/cpuinfo as fsgsbase could be another way to mess our GS(see https://www.kernel.org/doc/html/latest/x86/x86_64/fsgs.html).
  3. On every cpu initialization set GSBASE (the GS register) to some new per-cpu structure with at least one field - kernel_tcb - intended to hold the OSv value of the _tcb for the current thread running on this cpu. We would reset this field (but not GSBASE) on every context switch which should be quite cheap.
  4. Do not use swapgs on every entry and exit. Instead, somehow dereference the value of application tcb (if any) from the kernel tcb using the gs:<offset> method. Maybe the app tcb field would have to be saved in the thread_control_block for that reason. And that way reset the FSBASE only if that app tcb field is set. I think it is doable. We would still somehow need to account for possible nesting scenarios.

Any other ideas?

@nyh
Copy link
Contributor

nyh commented Apr 18, 2023

By the way, regarding GS, https://www.kernel.org/doc/html/latest/x86/x86_64/fsgs.html says that "The GS segment has no common use and can be used freely by applications.". I guess this means we can decide not to support the applications (if there are any) that use it - and it's not used by glibc or the ABI or anything like that.

wkozaczuk added a commit to wkozaczuk/osv that referenced this issue Oct 16, 2023
…utables

Even simplest executables need thread local storage (TLS) and a good example
of it is errno which is a thread local variable. The OSv kernel itself uses
many thread local variables and when running dynamically linked executables
it shares the TLS memory block with the application. In this case OSv fully
controls the setup of TLS and stores the pointer to TCB (Thread Control Block)
as part of a thread state and updates the FS register on every context switch.

On other hand, the statically linked executables setup their TLS and register
it with kernel by executing syscall arch_prtcl on x86_64. In order to support
it in OSv, we need to implement the arch_prtcl syscall and modify some key
places in kernel code - syscall handler, exception handlers and VDSO -
to switch from application TLS to the kernel one and back.

The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application
TCB address in the new field app_tcb added to the thread_control_block structure.
At the same time, we modify following places to support switching between the
application and kernel TCB if necessary (app_tcb != 0):

The exception handler assembly in entry.S is modified to detect if on entry
the current FS register points to the kernel TCB and if not it switches to the kernel
one; on exit from exception, it switches back to the application TCB. To make this
possible we "duplicate" the current thread kernel TCB address and store it in the new
field _current_thread_kernel_tcb of the cpu_arch structure which is updated
on every context switch and can be accessed in assembly as gs:16. The first
8 bytes field self of the thread control block holds the address to itself
so we can easily compare fs:0 with gs:16 to know if FS register points to the
kernel TCB or not. Please note this scheme is simpler and faster than one of
the original version relying on extra counter that also required the interrupts
to be disabled. It also works in nested scenarios - for example a page fault
interrupted during a sleep.

Similarly, we also change the syscall handler and VDSO code where we use simple
RAII utility - tls_switch - to detect if current thread has non-zero application
TCB and if so to switch to the kernel one before the code in scope and switch
back to the application one after. This scheme is a little different from the
exception handler because both syscall and VDSO functions are only executed
on application threads which could have the FS register pointing to the application
TCB for example when running statically linked executables and we do not need to
deal with nesting.

In addision, the vdso code is changed to C++ to allow using the C++ RAII utility
described above.

In essence, this PR makes possible to launch simple statically linked executables
like "Hello World" on OSv:

gcc -static -o hello-static apps/native-example/hello.c

./scripts/run.py -e /hello-static
OSv v0.57.0-74-g2a835078
Booted up in 142.76 ms
Cmdline: /hello-static
WARNING: Statically linked executables are only supported to limited extent!
syscall(): unimplemented system call 218
syscall(): unimplemented system call 273
syscall(): unimplemented system call 334
syscall(): unimplemented system call 302
Hello from C code

Please note, that the code changes touch some critical places of the kernel
functionality - context switching, syscall handling, exception handling, and VDSO
implementation - and thus may slightly affect their performance.

As far as context switching goes, this patch adds only a single memory write
operation that does not seem to affect it in any measurable way based on what
the misc-ctxsw.cc indicates.

On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of
the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically
linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel
one and back. The good news is that the syscall handling does not seem to be affected
in any significant way when running dynamically linked executables.

The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code
uses the same exact tls_switch RAII utility and seems to get inlined in similar way
as above in the syscall handler.

Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc)
but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in
the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ.
On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase.

Fixes cloudius-systems#1137

Signed-off-by: Waldemar Kozaczuk <[email protected]>
wkozaczuk added a commit to wkozaczuk/osv that referenced this issue Oct 24, 2023
…utables

Even simplest executables need thread local storage (TLS) and a good example
of it is errno which is a thread local variable. The OSv kernel itself uses
many thread local variables and when running dynamically linked executables
it shares the TLS memory block with the application. In this case OSv fully
controls the setup of TLS and stores the pointer to TCB (Thread Control Block)
as part of a thread state and updates the FS register on every context switch.

On other hand, the statically linked executables setup their TLS and register
it with kernel by executing syscall arch_prtcl on x86_64. In order to support
it in OSv, we need to implement the arch_prtcl syscall and modify some key
places in kernel code - syscall handler, exception handlers and VDSO -
to switch from application TLS to the kernel one and back.

The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application
TCB address in the new field app_tcb added to the thread_control_block structure.
At the same time, we modify following places to support switching between the
application and kernel TCB if necessary (app_tcb != 0):

The exception handler assembly in entry.S is modified to detect if on entry
the current FS register points to the kernel TCB and if not it switches to the kernel
one; on exit from exception, it switches back to the application TCB. To make this
possible we "duplicate" the current thread kernel TCB address and store it in the new
field _current_thread_kernel_tcb of the cpu_arch structure which is updated
on every context switch and can be accessed in assembly as gs:16. The first
8 bytes field self of the thread control block holds the address to itself
so we can easily compare fs:0 with gs:16 to know if FS register points to the
kernel TCB or not. Please note this scheme is simpler and faster than one of
the original version relying on extra counter that also required the interrupts
to be disabled. It also works in nested scenarios - for example a page fault
interrupted during a sleep.

Similarly, we also change the syscall handler and VDSO code where we use simple
RAII utility - tls_switch - to detect if current thread has non-zero application
TCB and if so to switch to the kernel one before the code in scope and switch
back to the application one after. This scheme is a little different from the
exception handler because both syscall and VDSO functions are only executed
on application threads which could have the FS register pointing to the application
TCB for example when running statically linked executables and we do not need to
deal with nesting.

In addision, the vdso code is changed to C++ to allow using the C++ RAII utility
described above.

In essence, this PR makes possible to launch simple statically linked executables
like "Hello World" on OSv:

gcc -static -o hello-static apps/native-example/hello.c

./scripts/run.py -e /hello-static
OSv v0.57.0-74-g2a835078
Booted up in 142.76 ms
Cmdline: /hello-static
WARNING: Statically linked executables are only supported to limited extent!
syscall(): unimplemented system call 218
syscall(): unimplemented system call 273
syscall(): unimplemented system call 334
syscall(): unimplemented system call 302
Hello from C code

Please note, that the code changes touch some critical places of the kernel
functionality - context switching, syscall handling, exception handling, and VDSO
implementation - and thus may slightly affect their performance.

As far as context switching goes, this patch adds only a single memory write
operation that does not seem to affect it in any measurable way based on what
the misc-ctxsw.cc indicates.

On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of
the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically
linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel
one and back. The good news is that the syscall handling does not seem to be affected
in any significant way when running dynamically linked executables.

The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code
uses the same exact tls_switch RAII utility and seems to get inlined in similar way
as above in the syscall handler.

Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc)
but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in
the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ.
On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase.

Fixes cloudius-systems#1137

Signed-off-by: Waldemar Kozaczuk <[email protected]>
wkozaczuk added a commit that referenced this issue Nov 4, 2023
…utables

Even simplest executables need thread local storage (TLS) and a good example
of it is errno which is a thread local variable. The OSv kernel itself uses
many thread local variables and when running dynamically linked executables
it shares the TLS memory block with the application. In this case OSv fully
controls the setup of TLS and stores the pointer to TCB (Thread Control Block)
as part of a thread state and updates the FS register on every context switch.

On other hand, the statically linked executables setup their TLS and register
it with kernel by executing syscall arch_prtcl on x86_64. In order to support
it in OSv, we need to implement the arch_prtcl syscall and modify some key
places in kernel code - syscall handler, exception handlers and VDSO -
to switch from application TLS to the kernel one and back.

The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application
TCB address in the new field app_tcb added to the thread_control_block structure.
At the same time, we modify following places to support switching between the
application and kernel TCB if necessary (app_tcb != 0):

The exception handler assembly in entry.S is modified to detect if on entry
the current FS register points to the kernel TCB and if not it switches to the kernel
one; on exit from exception, it switches back to the application TCB. To make this
possible we "duplicate" the current thread kernel TCB address and store it in the new
field _current_thread_kernel_tcb of the cpu_arch structure which is updated
on every context switch and can be accessed in assembly as gs:16. The first
8 bytes field self of the thread control block holds the address to itself
so we can easily compare fs:0 with gs:16 to know if FS register points to the
kernel TCB or not. Please note this scheme is simpler and faster than one of
the original version relying on extra counter that also required the interrupts
to be disabled. It also works in nested scenarios - for example a page fault
interrupted during a sleep.

Similarly, we also change the syscall handler and VDSO code where we use simple
RAII utility - tls_switch - to detect if current thread has non-zero application
TCB and if so to switch to the kernel one before the code in scope and switch
back to the application one after. This scheme is a little different from the
exception handler because both syscall and VDSO functions are only executed
on application threads which could have the FS register pointing to the application
TCB for example when running statically linked executables and we do not need to
deal with nesting.

In addision, the vdso code is changed to C++ to allow using the C++ RAII utility
described above.

In essence, this PR makes possible to launch simple statically linked executables
like "Hello World" on OSv:

gcc -static -o hello-static apps/native-example/hello.c

./scripts/run.py -e /hello-static
OSv v0.57.0-74-g2a835078
Booted up in 142.76 ms
Cmdline: /hello-static
WARNING: Statically linked executables are only supported to limited extent!
syscall(): unimplemented system call 218
syscall(): unimplemented system call 273
syscall(): unimplemented system call 334
syscall(): unimplemented system call 302
Hello from C code

Please note, that the code changes touch some critical places of the kernel
functionality - context switching, syscall handling, exception handling, and VDSO
implementation - and thus may slightly affect their performance.

As far as context switching goes, this patch adds only a single memory write
operation that does not seem to affect it in any measurable way based on what
the misc-ctxsw.cc indicates.

On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of
the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically
linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel
one and back. The good news is that the syscall handling does not seem to be affected
in any significant way when running dynamically linked executables.

The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code
uses the same exact tls_switch RAII utility and seems to get inlined in similar way
as above in the syscall handler.

Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc)
but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in
the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ.
On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase.

Fixes #1137

Signed-off-by: Waldemar Kozaczuk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants